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This  Technical  Report  has  been  developed  to  address  the  survivability  of 
telecommunications  networks  including  services.  It  responds  to  the  need  for  a 
common  understanding  of,  and  assessment  techniques  for  network  survivability, 
availability  integrity,  and  reliability.  It  provides  a  basis  for  designing  and 
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survivability  and  a  foundation  for  continuing  industry  activities  in  the  subject 
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and  covers  a  wide  range  of  users.  Two  frameworks  are  established  for  quantifying 
and  categorizing  service  outages,  and  for  classifying  network  survivability 
techniques  and  measures.  The  performance  of  the  network  survivability  techniques 
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13.  Abstract  (Continued 

The  intended  audience  for  this  Technical  Report  includes  providers,  users, 
and  regulators  of  telecommunications  networks  and  services.  These  include 
designers,  planners,  traffic  engineers,  and  individuals  in  charge  of  operations, 
maintenance,  management  and  administration.  They  can  use  this  report  to  quantify 
and  enhance  the  survivability  of  their  networks.  It  also  provides  a  service 
outage  framework  that  can  help  users  communicate  their  expectations  of  service  and 
their  requirements  for  network  survivability.  Equipment  vendors  can  use  it  to 
design  and  build  equipment  to  enhance  network  reliability.  The  government  may  use 
it  for  the  acquisition,  procurement,  and  the  preceding  purposes! 


A  Technical  Report  on 
Network  Survivability  Performance 


Abstract 

This  Technical  Report  provides  information  on  the  network  survivability 
performance  of  telecommunications  networks.  Although  techniques, 
parameters  and  methods  needed  to  study  network  survivability 
performance  are  defined,  recommended  parameter  objectives  are  not 
established. 
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Foreword 

This  report  addresses  the  growing  concerns  from  the  telecommunica¬ 
tions  community  about  the  survivability  of  telecommunications 
networks,  including  the  services  provided.  It  is  intended  to  provide  a 
basis  for  designing  and  operating  telecommunications  networks  to  meet 
users'  expectations  regarding  network  survivability. 

The  intended  audience  of  this  report  includes  providers,  users  and  regu¬ 
lators  of  telecommunications  networks  and  services,  as  well  as 
telecommunications  equipment  providers. 

Suggestions  for  improvement  of  this  report  are  welcome.  They  should 
be  sent  to  the  Exchange  Carriers  Standards  Association,  Suite  500,  1200 
G  Street  NW.,  Washington,  DC  20005. 
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0.  Executive  Summary 

As  a  result  of  growing  concerns  from  the  telecommunications  community,  this 
Technical  Report  has  been  developed  to  address  the  survivability  of 
telecommunications  networks,  including  the  services  provided. 

The  report  is  needed  to  provide  a  common  understanding  and  common  assessment 
techniques  for  network  survivability.  It  provides  a  basis  for  designing  and 
operating  telecommunications  networks  to  meet  users'  expectations  regarding 
network  survivability.  The  intended  audience  of  this  report  includes  providers, 
users  and  regulators  of  telecommunications  networks  and  services,  as  well  as 
telecommunications  equipment  suppliers.  The  report  also  provides  a  foundation 
for  continuing  industry  activities  in  this  area. 

Terminology  to  characterize  network  survivability  is  provided.  In  particular, 
network  survivability  is  defined  to  be;  (i)  the  ability  of  a  network  to  maintain  or 
restore  an  acceptable  level  of  performance  during  network  failures  by  applying 
various  restoration  techniques,  and  (ii)  the  mitigation  or  prevention  of  service 
outages  from  network  failures  by  applying  preventative  techniques.  Network 
survivability  includes  other  industry  terms,  such  as  “network  integrity”  and 
“network  reliability,”  and  is  related  to  “network  availability.” 

A  framework  for  quantifying  service  outages  is  developed.  The  parameters  for  this 
framework  are  the  unservability  of  services  affected  by  the  network  failure,  the 
duration  of  the  outage,  and  the  extent  of  the  failure  (e.g.,  geographical  area, 
population,  or  network).  Categories  of  service  outage  are  outlined.  The  categories 
depend  on  type  of  user,  network  and  service.  Types  of  users  include  carriers, 
residential  customers,  government  agencies,  educational  and  medical  institutions, 
as  well  as  business  and  financial  customers. 

A  four-layer  framework  is  described  for  classification  of  network  survivability 
techniques  in  telecommunication  networks.  These  layers  are  physical,  system, 
logical  and  service.  In  addition  to  providing  a  common  basis  for  describing  and 
comparing  techniques,  the  framework  identifies  layer(s)  responsible  for  reacting  to 
the  various  types  of  failures  and  their  interaction. 

Techniques  available  to  network  providers  to  enhance  the  survivability  of  their 
telecommunications  networks  at  each  layer  are  described.  Two  basic  approaches 
to  compare  survivability  techniques  and  evaluate  network  survivability 
performance  are  given.  The  first  approach  (Given  Occurrence  of  Failure,  or  GOF) 
uses  a  conditional  approach  and  defines  survivability  measures  for  a  network 
assuming  that  given  failures  have  occurred.  The  second  approach  (Random 
Occurrence  of  Failure,  or  ROF)  uses  probability  of  network  failure  and,  possibly, 
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rates  of  repair  and/or  restoration,  to  calculate  various  measures  of  network 
unservability  or  loss. 

Suggestions  are  given  for  the  general  industry.  Key  suggestions  are  outlined  here: 

•  quantify  service  outages  with  the  framework  described  herein, 

•  use  the  terminology  defined  herein  for  describing  network  survivability, 
including  network  reliability  and  network  integrity, 

•  use  the  layered  network  survivability  framework  described  herein  for 
clarifying  failure  survivability  analyses,  objectives  and  methods, 

•  plan  survivability  jointly  (e.g.,  interexchange  carrier  and  exchange  carrier 
interworking),  and 

•  use  the  performance  measures  defined  herein  to  compare  survivability 
techniques  and  to  evaluate  network  survivability  performance. 

Recommendations  are  also  given  for  future  standards  work.  Key  recommendations 
are  outlined  here: 

•  better  quantification  of  the  qualifying  regions  for  service  outage  categories, 

•  validation  of  traffic  characteristics, 

•  analysis  of  user  expectations  of  network  survivability  performance, 
planning,  engineering  arid  operations  guidelines  for  network  survivability 
performance,  and 

•  standardization  of  network  survivability  performance  measures. 

1.  Purpose,  Scope,  Application  and  Outline 

1.1  Purpose  As  a  result  of  growing  concerns  from  the  telecommunications 
community,  this  Technical  Report  has  been  developed  to  address  the  survivability 
of  telecommunications  networks,  including  the  services  provided.  The  report  is  a 
response  to  the  need  for  a  common  understanding  of,  and  assessment  techniques 
for,  network  survivability,  including  availability,  integrity,  and  reliability.  It  also 
provides  a  basis  for  designing  and  operating  telecommunications  networks  to  meet 
users  expectations  for  network  survivability.  Further  this  report  provides  a 
foundation  for  continuing  industry  activities  in  the  subject  area. 

1.2  Scope  This  Technical  Report  focuses  on  the  survivability  of  both  public 
and  private  networks  and  covers  a  wide  range  of  potential  users.  Two  frameworks 
are  established:  the  first  for  quantifying  and  categorizing  service  outages,  and  the 
second  for  classifying  network  survivability  techniques  and  measures.  The 
performance  of  network  survivability  techniques  is  considered;  however. 
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recommended  objectives  for  network  survivability  performance  are  not  established. 
Recommendations  are  made  for  continuing  work  on  this  topic. 

U  Application  The  intended  audience  of  this  Technical  Report  includes 
providers,  users,  and  regulators  of  telecommunications  networks,  and  services,  as 
well  as  suppliers  of  telecommunications  equipment.  Network  provider  personnel 
—  including  designers,  planners,  traffic  engineers,  and  individuals  in  charge  of 
operations,  maintenance,  management  and  administration  —  can  use  this  Technical 
Report  to  quantify  and  enhance  the  survivability  of  their  networks.  This  Technical 
Report  also  provides  a  service  outage,  framework  that  can  help  users  communicate 
to  die  providers  their  expectations  of  service  and  a  framework  that  allows  network 
providers  to  specify  their  requirements  for  network  survivability. 
Telecommunications  equipment  suppliers  can  use  this  Technical  Report  to  guide 
the  design  and  building  of  equipment  to  improve  network  survivability. 

1.4  Outline  Section  2  and  Appendix  A  review  related  work.  Several  aspects 
of  Operations,  Administration  and  Maintenance  (OA&M)  functions  are  covered  in 
other  Committee  T1  documents  (see  Section  2.1).  Section  3  is  the  introduction  to 
this  report.  Section  4  provides  a  'ramework  for  quantifying  and  categorizing 
service  outages.  Section  5  introduces  a  framework  for  classifying  network 
survivability  techniques  and  measures.  Network  survivability  techniques  that 
mitigate  the  impact  of  failures  or  service  degradation  are  addressed  in  Section  6. 
Definitions  and  examples  of  parameters  and  measures  to  evaluate  the  survivability 
performance  of  any  network  are  presented  in  Section  7  and  Appendix  B. 
Recommended  objectives  for  these  parameters  are  not  established.  Section  8  gives 
general  suggestions  and  Section  9  gives  recommendations  to  Committee  T1  and 
Working  Group  T1 A  1.2.  Section  10  summarizes  this  Technical  Report.  Appendix 
A  describes  the  Telecommunications  .Service  Priority  System.  Appendix  B  gives 
example  network  survivability  analyses  and  assessments.  Some  user  expectations 
for  service  outage  duration  are  given  in  Appendix  C,  and  tolerance  categories  for 
restoration  times  are  introduced  in  Appendix  D. 

2.  Related  Work 

2.1  Committee  T1  Standards  Work  In  addition  to  T1 A 1 ,  the  following  four 
other  T1  technical  subcommittees  are  involved  in  work  related  to  network 
survivability  performance:  T1E1,  T1M1,  T1S1  and  T1X1. 

T1E1  is  considering  environmental  standards.  These  standards  also  impact 
network  survivability  performance. 
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T1M1  Working  Group  2  (SONET1  OA&M)  is  considering  the  OA&M  aspects  of 
failed  networks. 

T1S1  Working  Group  3  (Signaling  System  Number  7,  or  SS7,  protocol)  is  studying 
congestion  controls  for  Common  Channel  Signaling  Networks  (CCSNs)  and  has 
produced  a  report  to  the  Signaling  Network  Systems  Focus  Team  of  the  Network 
Reliability  Council  (see  Section  2.3). 

T1X1  is  charged  with  developing  standards  for  ring  architectures,  protection 
switching  and  SONET  compatibility.  This  covers  a  range  of  survivable  network 
architectures. 

2.2  International  Standards  Work  The  following  Telecommunication 
Standardization  Sector  (ITU-TS,  formerly  CCITT)  study  groups  are  involved  in 
work  related  to  network  survivability  performance: 

Study  Group  2  addresses  the  overall  operation  of  telecommunication  networks. 

This  work  addresses  various  design  alternatives  for  survivable  networks. 

Study  Group  4  is  studying  the  restoration  of  failed  international  exchanges,  and 
transmission  systems. 

Study  Group  5  is  studying  survivability  issues  for  outside  plant. 

Study  Group  1 1  is  studying  CCSNs  and  switching  systems. 

Study  Group  13  is  looking  at  digital  network  architectures.  This  work  addresses 
rings  and  other  survivable  network  architectures. 

Study  Group  15  is  addressing  transmission  system  requirements,  including  the 
restoration  requirements  for  cross-connect  systems. 

2.3  Other  Domestic  Forums  and  Committees 

The  Industry  Carriers  Compatibility  Forum  (ICCF)  consists  of  representatives  from 
the  industry  carriers  and  major  vendors  and  deals  with  compatibility  problems 
(including  SS7)  between  carriers.  It  is  sponsored  by  the  Carrier  Liaison  Committee 
(CLC)  of  the  Exchange  Carriers  Standards  Association  (ECSA). 

The  Network  Operations  Forum  (NOF)  consists  primarily  of  representatives  from 
the  industry  carriers  Operations  departments  who  meet  periodically  to  deal  with 


1  Synchronous  Optical  NETwork. 
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operational  problems  of  mutual  interest.  It  also  falls  under  the  auspices  of  the 
CLC. 

The  Network  Reliability  Council  (also  known  as  the  NRC)  was  formed  by  the 
Federal  Communications  Commission  (FCC)  in  the  fourth  quarter  of  1991  in 
response  to  major  telecommunications  outages.  The  Network  Reliability  Council 
is  expected  to  provide  the  FCC  and  the  Telecommunications  Industry  with 
recommendations  for  prevention  of  public  telecommunications  network  outages 
and  minimization  of  the  impact  of  such  outages.  The  Network  Reliability  Council 
consists  of  the  following  seven  focus  teams:  1)  Signaling  Network  Systems,  2) 
Fiber  Cables  (with  focus  on  cable  digups),  3)  Digital  Cross-connect  Systems,  4) 
Power  Systems,  5)  Switching  (with  focus  on  software),  6)  Fire  Prevention,  and  7) 
Enhanced  91 1  Systems. 

2.4  U.S.  NSTAC  Task  Force  Work  The  following  task  forces  established  by 
the  National  Security  Telecommunications  Advisory  Committee  (NSTAC)  studied 
Network  Survivability  issues  and  forwarded  their  recommendations  to  the  U.S. 
Government: 

Telecommunications  Systems  Survivability  (TSS),  June  1988 
This  task  force  studied  the  work  of  five  previous  National  Security 
Telecommunications  Advisory  Committee  (NSTAC)  task  forces: 

1.  Commercial  Satellite  Survivability  (CSS),  June  1983  and  August  1989 

In  response  to  the  twelve  recommendations  of  this  task  force,  the 
government  developed  the  Commercial  SATCOM  Interconnectivity  (CSI) 
program  where  commercial  satellites  could  be  used  to  route  traffic  around 
damaged  areas  of  public  switched  networks. 

2.  Commercial  Network  Survivability  (CNS),  August  1985 

To  improve  the  survivability  of  public  switched  networks  when  Section  706 
of  the  Communications  Act  of  1934  is  invoked,  this  task  force 
recommended  interconnecting  the  commercial  telecommunications 
networks. 

3.  Electromagnetic  Pulse  (EMP),  July  1985 

As  a  result  of  this  task  force’s  recommendations,  the  government  and 
industry  is  addressing  EMP  in  T1E1. 

4.  National  Coordinating  Mechanism  (NCM),  December  1983 

This  task  force  called  for  the  establishment  of  the  National  Coordination 
Center  (NCC):  a  joint  industry-government  operation  to  initiate,  coordinate, 
restore,  and  reconstitute  public  switched  network  services  under  all 
conditions  of  crisis  or  emergency. 
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5.  Automated  Information  Processing  (AIP),  October  1985 

This  task  force  addressed  the  survivability  and  security  of  automated 
information  processes  within  public  switched  networks. 

Telecommunications  Industry  Mobilization  (TIM),  April  1989 
This  task  force  examined  the  telecommunications  industry’s  ability  to  rapidly  and 
effectively  “marshal  their  telecom  resources  needed  to  transition  from  a  normal 
state  to  a  state  of  readiness  for  war  or  other  national  emergency.” 

Telecommunication  Service  Priority  (TSP),  September  1990 
This  task  force  developed  a  system  for  priority  provisioning  and  restoration  of 
critical  federal,  state,  local,  and  foreign  government  communications  services.  It 
replaces  the  existing  Restoration  Priority  (RP)  System  and  revokes  the  public 
network  precedence  system.  A  detailed  discussion  of  the  TSP  system  is  in 
Appendix  A. 

National  Research  Council  Report  Task  Force,  March  1990 
This  task  force  reviewed  the  National  Research  Council’s  1989  report  “Growing 
Vulnerability  of  the  Public  Switched  Networks:  Implications  for  National  Security 
Emergency  Preparedness”  and  concluded  that  public  switched  networks  of  the 
1990s  are  growing  more  survivable  because  of:  (i)  the  diversity  provided  by  the 
major  interexchange  carriers,  (ii)  the  demand  by  network  users  for  very  high 
availability  of  network  services,  and  (iii)  the  development  of  increasingly  robust 
network  architectures  that  incorporate  major  advances  in  transmission,  switching, 
and  signaling  technologies. 

Intelligent  Network  (IN),  November  1990 

This  task  force  investigated  the  evolving  capabilities  of  the  IN  to  provide 
customized  software-controlled  network  services  and  examined  the  vulnerability 
and  interoperability  issues  associated  with  this  emerging  technology. 

Energy  Task  Force,  February  1990 

This  task  force  recommended  the  government  establish  a  priority  for  electric  power 
restoration  and  fuel  distribution  to  public  switched  network  telecommunications 
companies  during  emergency  conditions. 

3.  Introduction 

Historically,  the  designers  and  operators  of  telecommunications  networks  have 
placed  a  high  emphasis  on  overall  network  performance  quality  and  continuity  of 
the  services  they  provide.  This  has  been  termed  in  the  industry  as  network 
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Due  to  the  nature  of  business  today,  a  significant  number  of  end  users  have  an 
absolute  dependency  on  telecommunications.  Advancing  technology  (in  both 
hardware  and  software),  increased  centralized  control  and  services  evolution  have 
resulted  in  increased  complexity,  and  capacity  concentration.  Users  of  these 
networks  and  services  have  also  raised  their  requirements  and  expectations.  The 
demand  for  high-quality  performance,  assured  service  continuity,  and  transparency 
to  failures  has  never  been  greater.  Users  who  require  enhanced  levels  of  network 
survivability  performance  include  not  only  government,  military,  and  emergency 
organizations,  but  the  general  public  as  well.  This  is  termed  by  the  users  to  be 
network  reliability.  The  term  network  integrity  is  also  sometimes  used  in  this 
context.  The  use  of  the  term  “reliability”  in  this  network  context  should  not  be 
confused  with  the  more  technical  use  of  the  term  in  the  context  of  the  ability  of  a 
specific  component  or  equipment  to  perform  its  intended  function  over  a  specified 
period  of  time. 

Along  with  the  high  concentration  of  telecommunications  facilities  and  services 
brought  about  by  technology,  there  have  been  continuing  occurrences  of  service 
outages  caused  by  metallic  and  fiber  cable  cuts,  fires,  natural  disasters,  human 
errors,  software  faults,  sabotage,  and  malicious  damage.  The  potential  impact  of  a 
single  network  failure  is  higher  in  magnitude,  geographic  scope,  and  public 
perception  than  ever  before.  This  potential  impact  heightens  the  need  for  network 
survivability:  network  survivability  has  two  components:  (i)  the  ability  of  a 
network  to  maintain  or  restore  an  acceptable  level  of  performance  during  network 
failures  by  applying  various  restoration  techniques,  and  (ii)  the  mitigation  or 
prevention  of  service  outages  from  network  failures  by  applying  preventative 
techniques.  Preventative  techniques  can  reduce  the  need  for  restoration  techniques. 

Depending  on  the  type  and  scope  of  the  failures,  which  can  range  from  affecting 
very  large  geographic  areas  to  a  single  customer,  service  restoration  times  can 
range  from  many  days  to  milliseconds. 

The  type  and  scope  of  service  outages,  occurring  from  both  natural  and  human- 
caused  network  failures,  are  categorized  in  Section  4  as  being  one  of  catastrophic, 
major  or  minor.  These  categories  are  based  on  the  impact  of  the  failure  on  users. 
For  minor  failures,  network  restoration  can  be  automatic  and  virtually  transparent 
to  the  user.  However,  for  catastrophic  and  major  events  affecting  large 
populations,  some  service  may  be  restored  automatically,  but  most  service  will  be 
restored  manually  and  as  quickly  as  possible,  based  upon  availability  of  personnel 
and  replacement  facilities. 

A  framework  for  quantifying  and  categorizing  service  outages  in  specific  terms  has 
been  developed  in  Section  4.  The  parameters  of  this  framework  are:  (i)  the 
Unservability  (U)  of  some  or  all  of  the  provided  service  affected  by  the  failure;  (ii) 
the  Duration  (D)  during  which  the  service  outage  exists,  and  (iii)  the  Extent  (E)  in 
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terms  of  the  geographical  area,  population  and  traffic  volumes  affected  by  the 
network  failure.  Throughout  this  report,  this  framework  will  be  referred  to  as  the 
(U,  D,  E)  triple. 

To  lessen  the  occurrence  and  the  degree  of  such  service  outages,  survivability 
techniques  have  included  redundancy  of  networks  and  systems,  provisioning  of 
highest  reliability  systems,  good  preventative  maintenance,  monitoring  and 
surveillance,  manual  response,  and  optimized  restoration  and  repair  techniques. 

Survivability  techniques  are  being  developed,  based  on  intelligent  and  efficient 
technology,  that  provide  the  potential  and  capability  to  develop  new  and 
preemptive  techniques  for  network  survivability.  Innovative  developments  in  all 
layers  and  elements  of  networks  are  making  it  technically  and  practically  possible 
to  cope  with  many  types  of  network  failures.  When  this  is  coupled  with  well- 
planned  reactive  techniques  such  as  readiness  and  contingency  planning  (Disaster 
Recovery),  the  structure  is  in  place  for  the  capability  to  provide  a  full  spectrum  of 
network  survivability.  The  information  contained  within  this  Technical  Report  can 
help  to  provide  a  focus  and  impetus  for  ensuring  the  survivability  performance  of 
networks  and  services  in  general,  as  well  as  a  base  for  enhanced  network 
survivability  techniques  for  special  applications  needs. 

3.1  User  Categories  This  section  lists  a  range  of  telecommunications  network 
users,  and  groups  them  into  categories  based  on  broad  functional  or  service  related 
characteristics. 

User  concerns  have  been  heightened  by  telecommunications  failures  that  have 
dramatized  the  vulnerability  of  telecommunications  networks  to  fires,  viruses, 
software  errors,  optical  fiber/metallic  cable  cuts,  power  failures  and  other  natural  or 
human-caused  accidents. 

Users  have  unique  requirements  and  expectations  for  uninterrupted  service, 
depending  on  the  user  type,  service  value  and  subscription  cost.  To  meet  these 
user  expectations,  service  providers  may  make  use  of:  specific  network 
configurations,  dynamic  routing,  restoration  techniques  or  procedures, 
network/system  hardening,  prevention  of  unauthorized  access,  and  emerging 
technologies.  Specific  user  groups  with  similar  or  unique  expectations  include: 

•  Interexchange  Carriers  (IC), 

•  Exchange  Carriers  (EC), 

•  residential  customers  (including  Plain  Old  Telephone  Service,  or  POTS), 

•  government  agencies  (federal,  state  and  municipal), 

•  educational  institutions, 

•  business  customers,  (excluding  some  financial  institutions) 

•  cellular  carriers. 
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•  information  service  providers, 

•  other  access  providers, 

•  financial  institutions,  and 

•  emergency  organizations. 

Note  that  many  service  providers  are  also  considered  to  be  network  users.  Users 
need  to  identify  their  survivability  expectations.  An  example  of  the  U.S. 
Government  user  expectations  is  discussed  in  Appendix  C. 

As  an  example  of  a  user  network,  a  public  switched  network  illustrated  in  Figure 
3.1  is  considered  to  be  divided  into  IC  and  EC  portions2.  In  this  example,  an  IC  is 
a  user  of  two  ECs’  access  services. 


Figure  3.1:  IC  and  EC  Portions  of  a  Connection  in  a  Public  Switched  Network 

4.  Framework  for  Quantifying  and  Categorizing  Service  Outages 

This  section  introduces  a  general  framework  for  quantifying  service  outages  from  a 
users'  perspective,  applying  it  to  three  examples:  circuit  switched,  packet  switched 
and  leased  line  networks.  Before  describing  this  framework,  the  term  service 
outage  is  defined  as  follows: 

A  service  outage  is  the  state  of  a  service  when  network  failure(s)  impair  the 
initiation  of  new  requests  for  service  and/or  the  continued  use  of  the  service  and 


2  Cellular  carriers  arc  considered  to  be  both  1C  and  EC  carriers. 
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where  the  service  outage  parameters  defined  below  do  not  fall  into  the  “no 
outage”  qualifying  region. 

4.1  Service  Outage  Parameters  The  three  parameters  of  the  framework  are 
introduced  below: 

Unservability  (U)  A  major  impact  of  a  service  outage  is  service  disruption  to  all 
or  a  portion  of  users.  Unservability  is  defined  in  terms  of  a  unit  of  usage.  For 
example: 

•  In  a  circuit  switched  network,  the  most  common  function  of  the  network  is 
the  ability  to  establish  or  maintain  a  connection  from  a  source  to  a 
destination  within  the  limits  of  engineered  blocking  and  transmission 
performance.  The  unit  of  usage  is  a  call  attempt  or  call.  In  this  instance, 
unservability  is  the  percentage  of  units  that  fail. 

•  In  a  packet  network,  the  unit  of  usage  is  a  packet.  The  unservability  is 
defined  as  the  percentage  of  packets  that  were  not  delivered  within  the 
design  delay. 

•  In  a  leased  line  network,  unservability  is  defined  as  the  percentage  of  service 
units  failed  (e.g.,  DS-O,  DS-1  or  DS-3  service  units). 

Duration  (D)  This  parameter  defines  the  time  during  which  the  service  is 
unservable.  It  is  measured  by  determining  the  beginning  and  the  end  points  of  a 
network  failure,  based  on  the  performance  exceeding  the  unservability  threshold. 

Extent  (E)  This  parameter  includes  the  context  of  geographic  area,  population 
affected,  traffic  volumes,  and  customer  traffic  patterns,  in  which  the  unservability 
exceeds  a  given  threshold. 

One  can  categorize  service  outages  by  different  sets  of  values  for  which  the  (U,  D, 
E)  triple  qualifies  for  the  particular  category  of  outage.  If  one  were  to  graph  the 
(U,  D,  E)  triple  in  3-space,  then  the  values  for  a  particular  outage  category  would 
form  a  region,  which  we  define  as  the  qualifying  region.  For  example,  Figure  4.1 
illustrates  three  simple  qualifying  regions.  The  (U,  D,  E)  triple  is  the  basis  for 
quantifying  the  impact  of  service  outages  on  users.  Service  outages  may  be 
classified  as  catastrophic,  major,  or  minor,  depending  on  whether  the  values  of  the 
(U,  D,  E)  triple  fall  in  the  appropriate  qualifying  region.  The  different  sets  of 
values  for  the  (U,  D,  E)  triple  that  define  each  of  these  categories  are  under  study. 
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Figure  4.1:  Example  Qualifying  Regions  for  (U,  D,  E) 

Further  investigation  is  needed  to  determine  the  impacts,  if  any,  on  the 
infrastructure  of  network  elements,  operations  support  systems  and  interfaces  for 
measuring,  collecting,  and  analyzing  information  to  support  the  U,  D  and  E 
parameters.  If  the  network  infrastructure  is  not  initially  able  to  provide  the 
information,  investigation  into  the  identification  of  interim  procedures  may  be 
needed. 

4.2  Service  Outage  Categories  This  section  generically  defines  and  describes 
the  service  outage  categories  based  upon  the  above  framework. 

Depending  on  the  combination  of  a  failure's  U,  D  and  E,  the  following  three 
distinct  service  outage  categories  can  be  defined  for  both  natural  and  human- 
caused  failures  (in  order  of  decreasing  severity): 

Catastrophic  Possible  causes:  severe  network  node  failure(s)3,  caused  by 
earthquake,  flood,  hurricane,  tornado,  power  company  grid 
failure,  global  hardware/software  defect,  act  of  war,  etc. 

Major  Possible  causes:  toll  center,  common  channel  signaling,  network 

node  failure  caused  by  fire,  terrorism,  failure  of  cable  backbone 


3  Examples  include  a  critical  Central  Office  (CO)  failure,  one  or  more  STP  pair  failures,  multiple  EO,  AT,  or 
MTSO  failures. 
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route.  Digital  Cross-connect  System  (DCS)4  failure,  access 
tandem  switch  failure,  CO  power/battery  failure,  software 
defects,  fiber  cable  failure,  etc. 

Minor  Possible  causes:  single  fiber  failure,  equipment  failure  (including 

shelf  or  unit). 

The  goal  of  a  service  provider  is  to  restore  interrupted  service  as  quickly  as 
possible.  For  catastrophic  outages,  affecting  large  populations,  some  service  may 
be  restored  automatically,  but  most  service  will  be  restored  manually.  For  major 
outages,  a  larger  percentage  of  service  may  be  restored  automatically  and  the 
remainder  of  the  service  would  be  restored  manually.  For  minor  outages,  service 
restoration  can  be  automatic. 

It  also  should  be  recognized  that  the  users'  expectations  of  uninterrupted  service  is 
tempered  by  the  category  and  type  of  service  outage.  For  example,  in  a 
catastrophic  outage  caused  by  a  hurricane  and  affecting  a  very  large  number  of 
users  in  one  area,  most  users  may  recognize  that  service  will  be  restored  as  quickly 
as  is  humanly  possible.  The  time  for  total  restoration  of  service  for  every  user  may 
take  days  to  weeks.  Future  standards  activity  should  specify  quantifications  of 
these  service  outage  qualifying  regions  for  particular  services,  networks,  and  users. 

Considering  the  above  framework  for  measuring  service  outages.  Appendix  C 
discusses  users'  expectations. 

5.  Framework  for  Classifying  Network  Survivability  Techniques  and 
Measures 

This  section  discusses  a  four-layer  framework  (see  Table  5.1)  that  allows  for  the 
classification  of  network  survivability  techniques  in  telecommunication  networks5. 
Each  layer  can  be  characterized  by  a  collection  of  networks  (or  subnetworks).  Each 
of  the  networks  consists  of  nodes,  links,  traffic  or  demand6,  and  control  mechanisms 
and  control  networks.  The  links  provide  capacity  to  carry  the  traffic  and  the  nodes 
provide  points  of  access,  switching,  or  routing  for  the  traffic.  The  control 
mechanisms  for  each  of  these  networks  can 


4  The  term  EDSX  (Electronic  Digital  Signal  Cross-conncct)  is  often  used  fora  DCS  whose  interface  rate  equals  its 
cross-connect  rale,  c.g.,  a  DCS-3/3  that  cross-connects  DS-3s.  This  distinction  is  not  used  here. 

5  The  network  layers  defined  here  should  not  be  confused  with  layers  defined  in  other  more  restricted  contexts,  such 
as  the  OS1  layers  for  packet  networks  (sec  Section  6.4.2). 

6  Throughout  the  rest  of  this  document  the  terms  "traffic"  and  "demand"  are  used  interchangeably. 
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Table  5.1:  Network  Survivability  Layers 
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provide  the  functions  of  signaling,  traffic  routing  control,  traffic  admissibility 
control,  and  data  communications.  The  control  mechanisms  also  use  a  control  data 
communications  network;  for  communication  among  control  logic  points  and 
nodes.  The  control  data  communications  network  may  either  be  separate  (out-of- 
band)  or  coincident  (in-band)  with  the  traffic  bearing  network.  The  survivability  of 
the  control  data  communications  network  itself  also  should  be  studied  and 
evaluated. 

In  this  representation,  the  link  capacity  of  the  service  layer  in  turn  forms  the  traffic 
for  the  logical  layer,  whose  link  capacity  in  turn  forms  traffic  for  the  system  layer, 
etc.  Nodes  and  links  of  the  system  and  logical  layers  represent  transmission 
equipment  and  provide  the  bandwidth  required  for  the  service  layer  links.  Service 
layer  nodes  represent  the  switching  or  processing  equipment  that  uses  this 
bandwidth  to  provide  user  service  networks.7 

Network  survivability  techniques  can  be  deployed  at  various  layers  of  this 
framework  to  ensure  that  failures  interrupt  end-users  as  little  as  possible.  Failures 
within  a  layer  can  be  guarded  against  by  techniques  either  in  that  layer  or  at  a 
higher  layer8 .  Beginning  with  the  physical  layer,  the  function  of  each  layer  is 
described  below.  Sample  network  survivability  techniques  representing  network 
based  solutions  for  link  and  node  failures  for  each  layer  are  described  in  Section  6. 

5.1  Physical  Layer  The  physical  layer  can  be  viewed  as  having  two  sublayers: 
the  geographical  sublayer  and  the  media  sublayer.  The  geographical  sublayer 
consists  of  the  assets  that  house  the  media  and  other  equipment  associated  with  the 
upper  layers,  e.g.,  buildings,  conduits,  manholes  and  rights-of-way.  The  media 
sublayer  contains  the  transmission  media  used  by  the  transmission  systems,  e.g., 
fiber  cable,  copper  cable  and  radio  spectra. 

Figure  5.1  illustrates  an  example  physical  layer  subnetwork.  The  links  of  this 
example  network  represent  fiber  cables  (media)  that  in  turn  route  over  conduit 
sections  (not  shown)  between  the  buildings  or  possible  repeater  site  locations. 


7  The  service,  logical,  system  and  physical  layers  defined  here  correspond  roughly  to  the  circuit,  path,  section  and 
physical  media  layers,  respectively,  used  in  CCITT  G.803. 

8  For  example,  a  cable  cut  may  be  repaired  or  restored  in  the  physical,  system  or  logical  layers.  Otherwise,  it  can  be 
restored  by  the  service  layer,  in  contrast,  a  switch  failure  at  the  service  layer  can  only  be  restored  in  the  service 
layer. 
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Figure  5.1:  Example  Physical  Layer  Subnetwork 

5.2  System  Layer  The  system  layer  represents  the  network  transmission 
systems,  and  the  system  terminating  and  full-rate  interface  equipment.  Network 
transmission  systems  include  both  cable-based  (fiber  and  metallic)  and  wireless 
technologies  (e.g.,  cellular.  Personal  Communications  Network  (PCN)  and  radio). 
Network  survivability  techniques  at  the  system  layer  reconfigure  the  network 
components  to  protect  the  logical  layer,  and  thus  die  service  layer,  against  physical 
or  system  layer  failures.  These  techniques  combine  system  diversity  with 
geographical  diversity  and  reserve  capacity  to  allow  bandwidth  lost  by  failures  to 
be  restored. 

Figure  5.2  illustrates  a  subnetwork  at  the  system  layer.  It  consists  of  a  SONET 
[44,38]  OC-48  (2.448  gigabits/second)  self-healing  ring  [40,42]  and  two  SONET 
point-to-point  OC-12  (622  megabits/second)  fiber  systems.  The  capacity  over  the 
links  of  this  network  becomes  demand  (traffic)  for  die  physical  layer  example 
network  shown  in  Figure  5.1 .  For  example,  the  B-E  OC-12  fiber  system  shown  in 
Figure  5.2  routes  over  two  fiber  cable  links  (node  pairs  B-D  and  D-E)  in  the 
physical  layer  example  of  Figure  5.1. 

A  B 


OC-12 

ADM  s  Add/Drop  Multiplexer 
MUX  s  Terminal  Multiplexer 
OC-n  s  SONET  Optical  Carrier  n-rate 
_  _  (n-rate  ie  nx51.84 

D  E  megabit e/eecond) 

Figure  5.2:  Example  System  Layer  Subnetwork 

5.3  Logical  Layer  The  next  layer  of  the  planning  framework  is  the  logical 
layer,  in  which  lower-rate  transmission  channels  such  as  DS-Os,  DS-ls  or  DS-3s 
and  their  interface  equipment  appear.  Using  the  full-rate  transmission  channels  of 
the  system  layer,  the  logical  layer  provides  channels  for  use  by  the  service 
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networks  in  the  service  layer.  After  a  failure  is  detected  by  the  logical  layer, 
network  survivability  techniques  act  to  preserve  the  logical  network  that  the  service 
layer  uses. 

There  are  many  possible  distinctions  between  the  traffic  characteristics  of  a 
network  at  the  logical  layer  and  a  network  at  the  service  layer.  Traffic  holding 
times  are  one  such  distinction.  For  example,  traffic  for  a  wideband  DCS  network 
(a  wideband  DCS  interfaces  digital  signals  at  the  DS-3,  or  STS-n  rate,  and  cross  - 
connects9  the  constituent  channels  at  the  DS-1  or  SONET  Virtual  Tributaiy  (VT) 
group  rate;  a  broadband  DCS  interfaces  digital  signals  at  the  DS-3  or  STS-n  rate 
and  cross-connects  the  constituent  channels  at  the  DS-3  or  STS-1  rate)  at  the 
logical  layer  are  DS-ls  that  represent  private  line  demand  or  multiplexed  trunks  of 
a  circuit  switched  network.  Here,  traffic  holding  time  represents  the  time  between 
connection  and  disconnection  of  a  DS-1,  which  could  range  from  days  to  years. 
Also,  this  traffic  does  not  have  easy  re-connect  (“redial”)  capability  as  in  a  circuit 
switched  network.  In  contrast,  the  traffic  in  a  circuit  switched  network  at  the 
service  layer  is  calls,  whose  holding  times  are  typically  short  -  in  the  range  of 
minutes. 

The  example  in  Figure  5.3  shows  a  logical  layer  subnetwork  of  wideband  DCSs. 
Here,  the  links  represent  families  of  DS-3  or  SONET  STS-1  capacity  between  the 
DCSs.  These  capacities  become  demand  (traffic)  for  the  example  system  layer 
subnetwork  in  Section  5.2.  For  example,  the  DS-3s  between  nodes  A  and  B  route 
over  nodes  A-C-B  of  the  SONET  self-healing  ring  in  the  system  layer  example 
subnetwork.  Note,  although  not  illustrated  in  this  simple  example,  there  can  be 
further  layering  of  subnetworks  within  the  logical  layer  (e.g.,  the  link  capacities  of 
a  subnetwork  of  wideband  DCSs  may  route  over  a  subnetwoik  of  broadband 
DCSs). 


9  Cross-connecting  is  the  process  of  connecting  between  terminal  blocks  on  the  two  sides  of  a  distribution  frame,  or 
between  terminals  on  a  terminal  block.  The  same  connection  is  also  made  through  software  implementations  of  this 
process  by  a  DCS. 
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5.4  Service  Layer  The  top  layer  of  the  planning  framework  is  the  service 
layer.  This  layer  contains  user  service  networks  such  as  public  and  private  data, 
voice  and  CCSNs.  Survivability  techniques  offered  in  this  layer  can  include: 

•  size  limits, 

•  dynamic  routing, 

•  message  retransmission, 

•  multi-serving,  and 

•  reconfiguration  (reconnection  and  redirection). 

Service  layer  rerouting  and  reconfiguration  techniques  would  normally  be  activated 
for  capacity  that  is  not  restored  by  techniques  invoked  at  the  system  or  logical 
layers.  Service  layer  techniques  form  the  last  safeguards  before  failures  become 
apparent  to  the  user. 

Figure  5.4  illustrates  an  example  service  layer  subnetwork.  This  represents  a 
portion  of  a  circuit-switched  network.  Here,  the  links  represent  trunk  groups  (with 
integrated  DS-1  interfaces)  between  circuit  switching  systems.  These  capacities 
become  demand  (traffic)  for  the  example  logical  layer  subnetwork  in  Figure  5.3. 
For  example,  the  direct  trunks  between  switches  A  and  D  form  DS-ls  that  route 
over  nodes  A-B-D  in  the  logical  layer.  The  demand  (traffic)  transported  by  this 
example  service  layer  subnetwork  are  normal  “telephone”  calls,  often  measured  as 
“traffic  load,”  (in  units  of  Erlangs  or  hundred  call  seconds). 


D 


SW=  Switch 


Figure  5.4:  Example  Service  Layer  Subnetwork 

5.5  Summary  The  increased  impact  of  failures  has  led  to  a  greater  awareness 
of  the  need  to  design  survivable  networks.  There  is  a  need  for  a  common 
framework  and  terminology  for  discussing  and  comparing  the  wide  variety  of 
network  survivability  techniques  that  are  available.  The  four-layer  model 
described  above  provides  such  a  framework.  Table  5.1  summarizes  this  framework 
and  lists  some  of  the  techniques  that  can  be  applied  at  each  layer  to  minimize 
service  interruptions  to  users. 

This  model  provides  a  method  of  categorizing  the  various  network  survivability 
techniques  in  terms  of  the  layers  at  which  a  network  responds  to  failure.  In 
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addition  to  providing  a  common  basis  for  describing  and  comparing  techniques, 
this  framework  identifies  the  layer(s)  responsible  for  reacting  to  the  various  types 
of  failures  and  their  interaction.  Failures  within  a  layer  can  be  guarded  against  by 
techniques  either  in  that  layer  or  at  a  higher  layer.  To  protect  the  service  users, 
network  planners  need  to  select  effective  combinations  of  network  survivability 
techniques  across  the  layers  for  incremental  deployment. 

6.  Network  Survivability  Techniques 

This  section  contains  short  summaries  of  various  techniques  that  provide  for 
survivability  of  telecommunications  networks.  These  techniques  are  generic  and 
are  based  on  the  four  network  layers  described  in  the  framework  of  Section  5. 

Basic  Steps  for  Restoration 

As  illustrated  in  Figure  6.1,  network  restoration  can  be  accomplished  through  six 
basic  steps  in  response  to  a  network  failure  within  the  four  network  layers: 

1.  Detection :  Detection  of  the  network  failure  by  the  appropriate  Operations 
Support  System  (OSS),  controller,  or  network  element. 

2.  Notification :  Notification  of  the  network  failure  through  the  control 
architecture.  Under  a  centralized  scheme  this  means  notifying  the  central 
operations  support  system  or  controller  of  the  failure,  either  from  the 
network  element  directly  or  via  another  operations  support  system  for 
network  surveillance.  Under  distributed  control;  this  means  notifying  the 
other  network  elements  via  the  links  connecting  network  elements  or  a 
separate  data  communications  network. 

After  the  notification  step,  two  parallel  processes  begin:  (i)  repair  of  the  failed 
components,  and  (ii)  reconfiguration  of  affected  traffic.  These  steps  can  be  applied 
to  implementing  most  network  survivability  techniques,  although  some  of  the  steps 
may  not  be  necessary.  Also  note  that. some  combinations  of  the  steps  may  be 
iterated.  For  example,  the  path  selection  and  rerouting  steps  may  be  iterated  for 
each  demand  that  is  rerouted. 

3a.  Fault  Isolation :  Under  the  repair  process,  (see  Figure  6.1),  fault  isolation  is 
the  identification  or  location  of  the  network  failure. 

3b.  Identification:  Under  the  reconfiguration  process,  the  affected  demands  are 
identified  for  rerouting. 

4a.  Repair:  The  repair  process  is  the  effort  to  repair  the  network  failure  to 
allow  normalization  of  the  network. 

4b.  Path  Selection:  Path  selection  is  the  selecting  of  an  alternate  path  for  each 
demand  to  be  rerouted,  either  through  data  look-up  or  algorithm. 
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5.  Rerouting :  Rerouting  is  the  moving  of  the  affected  demands  to  an  alternate 
path. 

6.  Normalization :  Normalization  is  detection  of  the  repair  of  the  network 
failure  and  the  return  to  the  normal  state  of  the  network.  This  normalization 
could  include  rerouting  of  affected  demands  to  new  or  original  routes. 


Figure  6.1:  Basic  Steps  for  Restoration 


Determination  of  alternate  paths  in  the  path  selection  step  can  be  of  two  types: 
precalculated  or  determined  dynamically  based  on  path  status.  A  precalculated 
method  of  alternate  path  selection  specifies  that  for  each  demand  that  must  be 
rerouted,  a  preplanned,  reserved  alternate  path  with  spare  capacity  is  identified. 

For  example,  self-healing  rings  (Section  6.2)  use  precalculated  methods.  Status 
based  methods  respond  using  current  network  status  knowledge  to  select  alternate 
paths.  For  example,  DCS  reconfiguration  methods  (Section  6.3)  often  use  status- 
based  methods. 

The  two  general  approaches  for  status  based  path  selection  methods  are  preplanned 
lists  of  paths  stored  as  local  routing  tables  and  dynamic  path  generation,  where  the 
paths  are  created  at  the  time  of  failure. 

The  following  network  survivability  techniques  for  each  layer  may  utilize  paths 
established  via  either  wireline  (cable)  or  wireless  technologies.  Wireline 
technologies  consist  of  copper  cable  (coaxial  or  twisted  pair)  and  fiber  optics. 
Wireless  technologies  include  terrestrial  radio,  satellite,  cellular,  and  the  evolving 
PCNs.  Each  of  these  technologies  brings  to  the  network  survivability  planning 
process  its  own  limitations.  Recognizing  that  a  limited  capacity  system  may  not  be 


19 


Technical  Report  No.  24 

adequate  by  itself  to  fully  restore  disrupted  services,  when  used  in  conjunction  with 
other  surviving  technologies,  each  could  play  an  important  role  in  minimizing 
disruptions  caused  by  network  failure. 

6.1  Physical  Layer  Physical  layer  survivability  falls  into  three  categories:  (i) 
geographical  diversity  and  flexibility,'  (ii)  security  to  human-caused  intrusion,  and 
(iii)  tolerance,  i.e.,  the  ability  of  components  to  resist  human-caused  or  natural 
disasters. 

Geographical  diversity  is  a  technique  whereby  pairs  of  buildings  are  connected  via 
multiple  paths  that  do  not  share  the  same  locations  or  rights-of-way  and  where 
critical  network  systems  are  spread  among  different  locations.  To  enhance 
network  survivability,  network  and  service  providers  need  to  ensure  sufficient 
redundant  and  diverse  facilities  of  technologically  similar  media  are  in  place  to 
transport  traffic  and  maintain  uninterrupted  services  should  the  primary  media 
become  damaged.  Where  technologically  similar  media  are  not  available,  the 
restoration  links  need  to  be  equipped  to  provide  the  same  level  of  transport  as  the 
primary. 

Given  the  higher  bit  rate  and  increased  capacity  of  optical  fiber  cables,  restoration 
of  traffic  transported  via  this  medium  must  be  moved  to  other  facilities  with  the 
same  capacity.  This  suggests  that  the  fiber  diversity  route  must  be  technologically 
similar  and  of  equal  capacity  to  restore  the  affected  demands  while  not  overtaxing 
the  unaffected  media.  It  also  suggests  that  the  restoration  routes  could  be  a 
combination  of  technologically  similar  media  which,  when  combined,  has 
sufficient  capacity  to  restore  the  affected  demands. 

Security  of  the  physical  layer  should  be  a  major  point  of  consideration.  Increasing 
and  maintaining  a  high  level  of  physical  security  is  necessary  to  ensure  protection 
from  damage  caused  by  persons  intent  on  disrupting  telecommunications  services. 

Physical  layer  techniques  which  enhance  building  and  telecommunication  systems 
design  availability  do  so  by  increasing  their  ability  to  tolerate  external  and 
environmental  effects.  For  example,  good  fire  detection  and  suppression  systems 
are  well-known  methods  to  limit  the  degree  of  damage  suffered  by  network  assets. 

Standards  which  address  some  of  these  areas  have  been  developed  or  are  under 
development  in  T1  El : 

•  earthquake  protection  and  testing  (project  approved), 

•  central  office  telecommunications  equipment  ambient  temperature  and 
humidity  (ANSI  T1 .304-1 989), 

•  DC  power  systems  environment  protection  (ANSI  T1.31 1-1991), 
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•  central  office  equipment  electrostatic  discharge  (ANSI  T1 .308- 1991), 

•  electrical  protection  for  central  offices  and  facilities  (ANSI  T1 .313-1991), 
and 

•  telecommunications  equipment  ignitability  requirements 
(ANSI  T1 .307 -1990). 

6.2  System  Layer  System  layer  survivability  is  enhanced  by  considering 
techniques  which  support  reliable  connectivity  over  existing  surviving  links.  These 
include  automatic  cutover  systems,  link  diversity,  spare  or  excess  capacity,  coding, 
spread  spectrum  transmission,  and  error  correction  in  addition  to  the  following. 

6.2.1  Point-to-Point  Systems  with  Automatic  Protection  Switching 
Automatic  Protection  Switching  (APS)  is  a  transport  network  survivability 
technique  where  both  ends  of  a  working  channel's  termination  equipment  switch  to 
a  full  bandwidth  protection  channel  in  the  event  of  link  failure  (see  Figure  6.2). 

APS  can  either  have  one  protection  channel10  for  each  working  channel  (1+1  or 
1 :1),  or  have  one  protection  channel  for  one  or  more  ( n )  working  channels  (1  :n). 
The  working  and  protection  channels  may  make  use  of  diverse  routing  for  added 
network  survivability.  A  point-to-point  system  with  diverse  routing  does  not 
provide  individual  channel  add-drop  capability  at  intermediate  locations  and  hence 
another  protection  architecture  must  be  used  to  provide  this  type  of 
app!ication.[27]. 


STE  A  thru-CO  STE  Z  STE  A  thru-CO  STE  Z 


CO  «  Central  Office 

Diverse  Protection  Non-Diverse  Protection 

Figure  6.2:  Diverse  vs.  Non-Diverse  Protection 

6.2.2  Rings  A  ring  is  a  collection  of  nodes  forming  a  closed  loop  whereby  each 
node  is  connected  to  two  adjacent  nodes  via  a  duplex  communications  facility.  A 
ring  provides  redundant  bandwidth  and/or  network  equipment  so  disrupted  services 
can  be  automatically  restored  following  network  failures. 


10  1+1  systems  transmit  on  the  protection  channel  at  all  limes.  1:1  systems  transmit  on  the  protection  channel  only 
after  the  failure  (the  protection  channel  is  normally  idle). 
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Rings  can  be  divided  into  two  general  categories,  unidirectional  and  bidirectional, 
according  to  the  direction  of  traffic  flow  under  normal  conditions.  In  the  SONET 
application  [40,42],  unidirectional  and  bidirectional  architectures  are  defined  based 
on  the  direction  in  which  incoming  and  associated  returning  tributaries  travel 
around  the  ring.  An  incoming  tributary  is  the  signal  added  into  the  ring,  and  the 
returning  tributary  is  the  associated  signal  dropped  from  the  ring. 

6.2.2.1  Unidirectional  Ring  In  a  Unidirectional  Ring  (UR),  working  traffic 
is  carried  around  the  ring  in  one  direction  only  (e.g.,  clockwise).  Refer  to  Figure 
6.3.  Traffic  from  any  node  A  to  any  other  node  B  is  routed  along  the  working 
communications  ring  from  A  to  B  via  D  and  C,  and  the  return  traffic  continues 
around  the  ring  from  B  back  to  A,  in  the  same  direction  of  transport  using  the 
remaining  portion  of  the  working  communications  ring.  Thus,  traffic  arrives  at 
nodes  A  and  B  by  different  routes.  Because  the  transmission  of  normal  working 
traffic  on  the  UR  is  in  one  direction  only,  ring  capacity  is  determined  by  the  sum  of 
demands  between  each  pair  of  nodes.  URs  are  sometimes  called  “counter-rotating 
rings”  because  the  second  communications  ring  (for  protection  only)  transmits  in 
the  opposing  direction  as  the  first.  This  is  also  known  as  1 :1  protection,  or  1 : 1  UR. 


— ►  Working 
Protection 


Figure  6.3:  Unidirectional  Ring 

6.2.2.2  Bidirectional  Ring  In  a  Bidirectional  Ring  (BR),  working  traffic 
travels  in  both  directions  over  a  single  path  that  uses  the  two  parallel 
communications  paths  (operating  in  opposite  directions)  between  the  nodes  of  the 
ring  (e.g.,  between  A  and  B),  hence  the  name  BR.  Since  traffic  is  routed  over  a 
single  path  between  nodes,  spare  capacity  around  the  ring  can  be  shared  on  a  per 
span  basis  (i.e.,  demand  carried  per  span),  and  not  dedicated  to  the  total  demand  on 
the  ring  as  for  a  UR.  In  a  BR,  traffic  is  often  routed  over  the  shortest  path  and  is 
identified  during  provisioning.  This  is  referred  to  as  a  shortest  path  algorithm  or 
nondemand  splitting.  (For  optimal  load  balancing  algorithms,  however,  some 
traffic  may  be  routed  over  the  longer  path.)  Nondemand  splitting  basically  means 
that  the  total  demand  between  any  two  points  on  the  ring  travels  the  same  route 
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(shortest  path),  and  is  not  normally  split  between  two  different  routes  (i.e.,  split 
between  shortest  and  longest  paths).  BRs  are  engineered  based  on  the  span 
carrying  the  maximum  amount  of  traffic.  A  BR  may  use  four  fibers  or  two  fibers 
depending  upon  the  spare  capacity  arrangement,  as  shown  in  Figure  6.4. 


Figure  6.4:  Bidirectional  Rings 

In  a  four  fiber  BR  (or  1 : 1  configuration),  working  and  protection  channels  use 
separate  communications  paths  (fibers).  The  1 : 1  configuration  has  evolved  from 
today’s  point-to-point  protected  systems  and  can  provide  line  protection  switching 
using  similar  SONET  automatic  protection  switching  procedures. 

In  a  two  fiber  BR,  working  and  protection  channels  use  the  same  two  parallel 
communications  paths  (i.e.,  reserve  half  bandwidth  for  protection).  This  ring 
arrangement  can  provide  line  protection  switching  by  using  a  time  slot  interchange 
method  for  merging  working  channels  with  protection  channels. 

Unidirectional  and  bidirectional  rings  can  be  further  categorized  into  line  and  path 
protection  switched  rings  according  to  the  SONET  layer  used  to  trigger  the 
protection  switch  action  that  enables  the  ring  to  automatically  recover  from 
failures.  Bidirectional  path  switched  rings  are  not  discussed  here. 

6.2.2.3  Line  Protection  Switched  Ring  A  line  protection  switched  ring 
architecture  uses  SONET  line  layer  indications  to  trigger  the  protection  switching 
action  and  may  be  unidirectional  or  bidirectional.  Switching  action  is  performed  at 
only  the  line  level  to  recover  from  failures,  and  does  not  involve  path  level 
indications.  Line  level  indications  include  line  level  failure  conditions  and 
signaling  messages  that  are  sent  between  nodes  to  effect  a  coordinated  line 
protection  switch. 

An  example  of  a  unidirectional,  line  protection  switched  ring  (UR/L)  is  a  “folded,” 
UR  implementation,  that  folds  (or  loops)  the  disrupted  line  signal  (i.e.,  the  optical 
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line  signal  or  n  line  SPEs11)  onto  a  separate  protection  fiber  ring  (1:1  concept). 

This  example  is  shown  in  Figure  6.5.  The  nodes  adjacent  to  the  break  (nodes  A 
and  B  in  Figure  6.5)  perform  the  fold  or  looping  function.  Because  of  the  loopback 
capabilities  at  the  nodes  adjacent  to  the  break,  the  folded  ring  remains  in  a  ring 
topology. 


Figure  6.5:  Unidirectional  Ring  Implementations 

An  example  of  a  bidirectional,  line  protection  switched  ring  is  a  two  fiber  BR 
(BR/2F)  implementation,  and  is  shown  in  Figure  6.6.  In  this  implementation, 
protection  capacity  is  provided  around  the  ring  and  is  available  to  restore  disrupted 
traffic  during  protection  switch  conditions.  This  protection  capacity  (i.e.,  channels) 
is  shared  by  all  spans  (i.e.,  between  adjacent  nodes)  on  the  ring.  In  the  case  of  a 
fiber  cut,  the  nodes  adjacent  to  the  break  set  up  bidirectional  paths  (loopback 
functions  using  Time  Slot  Interchange  (TSI))  on  the  protection  channels  in  the 
opposite  direction  for  only  those  services  affected  by  the  break.  The  loopback 
function  is  for  looping  back  disrupted  traffic  between  nodes  A  and  B  from  working 
to  protection  channels.  As  a  result,  all  traffic  that  was  normally  on  the  working 
channels  on  the  link  between  nodes  A  and  B  is  now  on  the  protection  channels  in 
the  reverse  direction  around  the  ring.  The  protection  channels  can  then  be  reused 
by  other  spans  when  the  protection  channels  become  available  again. 


1 1  Synchronous  Payload  Envelopes. 
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Figure  6.6:  Bidirectional  Ring  Implementation 


A  four  fiber  BR  (BR/4F)  implementation  uses  a  line  loopback  recovery  technique, 
but  unlike  the  BR/2F  implementation,  working  and  protection  channels  are  routed 
on  separate  communications  paths  (Fibers).  Another  BR/4F  implementation  uses  a 
combination  of  line  loopbacks  and  automatic  protection  switching  span  protection. 
The  line  loopback  function  protects  against  cable  cuts,  and  the  span  protection 
safeguards  against  equipment  failures. 

6.2.2.4  Path  Protection  Switched  Ring  A  path  protection  switched  ring 
architecture  uses  SONET  path  layer  indications12  to  trigger  the  protection 
switching  action  and  may  be  unidirectional  or  bidirectional.  Switching  action  is 
performed  at  only  the  STS  or  VT  path  level  to  recover  from  failures,  and  does  not 
involve  line  level  indications.  Path  level  indications  include  path  level  failure 
conditions  and,  if  necessary,  signaling  messages  that  are  sent  between  nodes  to 
effect  a  coordinated  path  protection  switch.  Path  switching  of  a  specific  path  is 
independent  of  other  paths'  status. 

In  some  unidirectional,  path  protection  switched  rings  (UR/P,  see  Figure  6.5),  both 
directions  of  working  traffic  between  any  node  A  and  any  other  node  B  travel  on  a 
single  fiber  ring  in  the  same  direction  (e.g.,  clockwise),  but  arrive  at  their 
destinations  by  different  paths  (i.e.,  the  two  directions  of  traffic  use  the  opposite 
portions  of  the  single  ring).  A  second  protection  fiber  ring  carries  a  duplicate  copy 
of  the  information  signals  in  the  opposite  direction.  This  is  considered  a  “dual-fed” 
or  “1+1  protection”  approach  when  duplicate  information  is  transmitted  in  both 
directions  on  the  ring.  Thus,  at  a  receiving  node  (e.g.,  node  A),  two  signals  (one 
from  each  direction)  are  available  for  signal  selection.  In  the  case  of  a  break  in  the 
ring  (between  node  A  and  B),  receiving  node  A  switches  to  the  protection  path  for 


12  DS-l  or  DS-3  Alarm  Indication  Signals  (AISs)  arc  also  used  for  this  purpose. 
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path  restoration.  Receiving  node  B  was  not  affected  by  the  break,  so  no  path 
switching  is  necessary.  For  each  disrupted  path,  path  switching  requires 
identifying  and  selecting  the  valid  (or  protection)  path  for  path  restoration.  The 
information  for  the  switching  control  is  conveyed  within  the  path  signal  itself  (e.g.. 
Path  AIS).  It  should  be  noted  that,  unlike  common  approaches  to  ring 
implementations  which  maintain  their  ring  characteristics  following  a  fault,  this 
implementation  ceases  functioning  as  a  ring  if  a  break  in  the  ring  occurs,  i.e.,  it 
does  not  remain  in  a  ring  topology  (no  loopbacks)  following  a  break  in  the  ring  or 
the  loss  of  a  node.  However,  communications  among  the  nodes  is  maintained 
following  such  a  break. 

6.3  Logical  Layer  The  logical  layer  reestablishes  active  demands  that  route 
over  failed  network  links  or  nodes,  as  contrasted  with  a  circuit  switched  network  in 
the  service  layer,  where  existing  demands  are  lost  (and  must  redial).  This 
requirement  imposes  a  more  difficult  network  survivability  routing  task  on  the 
subnetworks  at  this  layer,  as  well  as  more  critical  restoration  intervals.  Restoration 
switching,  as  implemented  in  DCS  based  reconfiguration  strategies,  is  described 
below. 


6.3.1  DCS  Reconfiguration  Strategies  The  routing  capability  of  DCSs  can  be 
used  to  restore  service  in  case  of  network  failures.  By  changing  cross-connects, 
DCS  reconfiguration  methods  restore  service  by  routing  existing,  failed  demands 
(e.g.,  DS-1,  VT  1 .5,  or  STS-1  systems)  on  alternate  routes.  At  present,  DCS 
reconfiguration  methods  operate  on  a  network  of  either  wideband  DCSs  or  a 
network  of  broadband  DCSs.  However,  because  of  the  smaller  number  of  demands 
to  reroute,  most  DCS  reconfiguration  schemes  are  targeted  for  broadband  DCS 
networks. 

There  are  many  methods  being  developed  for  DCS  reconfiguration.  In  this  section 
we  generically  describe  the  steps  of  a  DCS  reconfiguration  scheme  as  well  as 
present  the  general  characteristics  of  the  different  approaches. 

DCS  repair  and  reconfiguration  processes  follow  the  basic  steps  identified  at  the 
start  of  Section  6  in  response  to  a  network  failure.  For  each  of  those  steps,  there 
are  a  variety  of  implementation  techniques.  However,  a  fundamental  characteristic 
that  applies  to  each  step  is  the  method  of  control:  under  centralized  control,  the 
step  is  principally  done  by  an  OSS  or  “controller,”  which  we  will  refer  to 
generically  as  a  “centralized  system,”  while  under  distributed  control,  the  step  is 
done  by  the  individual  DCSs.  Note  that  there  are  methods  that  mix  some  steps 
under  centralized  control  with  other  steps  under  distributed  control  (e.g., 
distributed  control  of  the  path  selection  step  with  centralized  control  of  the 
normalization  step). 
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Another  fundamental  characteristic  for  DCS  reconfiguration  is  the  method  for  data 
communications.  For  centralized  control  this  involves  communication  between  the 
DCSs  and  the  centralized  system(s).  For  distributed  control  this  involves 
communication  among  the  DCSs.  The  most  critical  aspect  of  data  communications 
for  DCS  reconfiguration  is  the  guarantee  of  a  secure  data  channel.  This  can  be 
described  succinctly  by  the  following  requirement:  for  a  given  demand,  if  there 
exists  a  surviving,  alternate  route,  then  data  communication  is  required  among  all 
DCSs  along  that  route  (to  enable  rerouting  of  the  demand).  Therefore,  in  general, 
communication  between  DCSs  and  the  centralized  system(s)  must  occur  over 
physically  redundant  routes.  For  distributed  control,  communication  between 
DCSs  can  use  SONET  overhead  (or  related  signal  overhead)  along  all  the  possible 
inter-DCS  “logical”  links.  This  aspect  guarantees  that  if  there  exists  a  surviving, 
alternate  route  between  two  DCSs,  then  communications  will  occur  along  this 
route  and,  thus,  the  requirement  is  satisfied. 

We  discuss  various  general  characteristics  of  DCS  reconfiguration  schemes  for 
each  step  below: 

6.3.1.1  Detection  and  Notification  Under  centralized  control,  DCSs  (or, 
possibly,  higher  level  network  elements)  detect  and  notify  the  centralized  system  of 
the  failure,  possibly  via  another  OSS  for  network  surveillance.  It  is  possible  that 
these  alarms  are  received  at  a  variety  of  multiplex  levels  and  the  centralized  system 
has  to  “locate”  and  analyze  the  fault,  i.e.,  deduce  which  channels  are  affected  (e.g., 
which  STS- Is  or  DS-ls  are  affected).  Under  distributed  control,  the  DCS  monitors 
each  of  its  channels  and  analyzes  the  faults;  although,  for  repair  purposes,  an  alarm 
may  still  be  sent  to  the  network  surveillance  OSS. 

63.1.2  Identification  Under  centralized  control,  the  centralized  system 
identifies  the  fault  and  instigates  the  reconfiguration  process.  It  may  have  various 
thresholds  to  screen  out  false  or  intermittent  alarms.  Under  distributed  control  this 
step  means  identifying  the  failure  to  the  DCSs  and  triggering  the  event.  When  a 
failure  is  identified  between  a  pair  of  DCSs,  this  implies  some  sort  of 
“handshaking”  or  prearranged  understanding  so  that  only  one  of  the  DCSs 
instigates  the  reconfiguration  process. 

6.3.1.3  Path  (Route)  Selection  .  Methods  for  implementing  this  step  can 
generally  be  described  by  the  three  following  characteristics: 

i)  Control :  centralized  or  distributed.  Centralized  control  means  that  a  centralized 
system  calculates  the  alternate  routes,  i.e.,  the  affected  DCSs  and  channel 
assignments.  Distributed  control  means  that  software  in  the  DCSs  calculates  the 
routes,  often  in  a  parallel  manner  among  all  the  DCSs.  It  is  possible  to  have 
combinations  of  both:  the  centralized  system  determines  higher  level  routes  or 
routing  guidelines  and  the  network  elements  choose  specific  routes.  For  example. 
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the  DCSs  can  use  routing  tables  that  are  populated  by  a  centralized  system,  an 
architecture  that  is  similar  to  that  used  in  switches  in  a  circuit  switched  network. 


ii)  Type  of  alternate  routing :  link  or  point-to-point13.  With  link  rerouting,  all 
demands  routed  over  the  failed  link  have  the  segment  of  their  route  corresponding 
to  the  failed  link  replaced  by  an  alternate  route.  Link  rerouting  is  illustrated  for  a 
wideband  DCS  subnetwork  with  DS-3  links  in  Figure  6.7.  Here  the  route  of  a 
sample  DS-1  demand  is  shown  by  a  dashed  line.  When  the  B-C  DS-3  link  fails, 
alternate  DS-1  routes  are  found  between  nodes  B  and  C.  The  “segment”  of  each 
DS-1  route  that  originally  routed  through  the  failed  link  is  cross-connected  to  the 
alternate  route,  sometimes  called  a  “patch”.  For  the  sample  DS-1,  the  alternate 
route  goes  through  nodes  B-A-D-C.  The  result  is  that  at  the  DS-1  level  there  is 
retracing  of  the  route.  With  point-to-point  rerouting,  constituent  demands  are  re¬ 
routed  from  end  node  to  end  node.  This  is  illustrated  in  Figure  6.8  where  the  DS-1 
is  routed  directly  over  the  A-D  DS-3  link. 


mmmmmm  old  path  of  DS-1  demand 
mm  —  new  path  of  DS-1  demand 
W-DCS  -  Wideband  DCS 


W-DCS 

A 


W-PCS 

D 


Figure  6.7:  Link  Rerouting  from  Link  B-C  to  B-A-D-C 

The  principal  advantages  of  link  rerouting  are  speed  and  ease  of  implementation 
because  the  DCS  needs  to  find  alternate  routes  between  only  one  node  pair,  e.g., 
nodes  B  and  C  in  Figure  6.7,  and  the  various  end  points  of  the  demands  that  route 
through  the  link  do  not  have  to  be  determined.  However,  it  is  complex  to  adapt 
some  link  rerouting  methods  for  multiple  link  failures  and,  especially,  node 
failures. 


13  Many  other  references  refer  to  point-to-point  rerouting  as  "path"  rerouting  (for  example,  see  [16]).  However, 
since  these  references  also  use  the  term  "alternate  path,"  even  for  link  rerouting  methods,  "point-to-point"  is  a  less 
ambiguous  term. 
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Figure  6.8:  Point-to-Point  Rerouting  of  A-D  DS-1  Demand  over  Spare 

Facilities 

Point-to-point  rerouting  requires  determination  of  individual  demands  and  their  end 
points  and  an  associated  route  selection  process  for  each  such  end  point  pair. 
Point-to-point  rerouting  schemes  tend  to  be  more  efficient  from  the  standpoint  of 
utilization  of  network  capacity,  i.e.,  they  require  less  network  spare  capacity  to 
achieve  the  same  level  of  network  survivability.  Another  advantage  is  that  some 
implementations  allow  the  capability  to  selectively  choose  which  demands  to 
reroute  in  case  of  failure.  Also,  because  demands  are  rerouted  between  their  end 
points,  point-to-point  rerouting  techniques  can  be  more  readily  adapted  to  handle 
node  failures.  However,  the  complexity  of  point-to-point  rerouting  techniques 
tends  to  make  them  slower  than  link  rerouting  techniques. 

iii)  Route  calculation  process:  precalculated  or  dynamic.  With  precalculated 
methods  the  alternate  routes  are  stored  in  a  database  (either  in  a  centralized  system 
or  locally  in  the  DCS)  and  retrieved  when  the  restoration  process  is  instigated.  If  a 
precalculated  method  stores  only  one  route,  then  for  each  demand  that  must  be 
rerouted,  a  single,  alternate  route  is  identified  and  reserved.  If  multiple  routes  are 
stored,  then  they  can  be  tried  in  sequence  until  a  functioning  route  (with  spare 
channels)  is  found.  Precalculated  methods  can  suffer  from  the  difficult  task  of 
satisfying  physical  diversity  requirements  when  preplanning  the  alternate  routes,  as 
well  as  database  integrity  issues.  Precalculated  schemes  also  suffer  from  the 
problem  of  keeping  the  data  current  under  a  changing  network.  The  main 
advantage  of  precalculated  schemes  is  simplicity  of  implementation  and  speed  of 
execution. 


Dynamic  routing  methods  use  current  network  status  knowledge,  i.e.,  link/node 
operational  status  (failed  or  working)  and  associated  link  spare  (also  called 
“occupancy”  or  “utilization”)  to  calculate  the  alternate  routes  at  the  time  of  failure. 
An  advantage  of  status  based  methods  is  that  failed  links  or  nodes  can  be  readily 
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identified  at  the  time  of  failure  and  avoided  in  the  selection  of  alternate  routes. 

This  simplifies  the  computational  burden  of  diversity  requirements  faced  by 
predetermined  methods.  However,  dynamic  routing  methods  are  more  complex  to 
implement  and  can  be  slower  to  execute. 

Most  methods  using  distributed  control  for  the  path  selection  step  implement 
dynamic  routing  via  various  adaptations  of  a  method  called  “flooding.”  In  this 
method  for  a  given  pair  of  DCSs  (say,  A  and  Z)  between  which  alternate  paths  are 
sought,  one  of  the  DCSs  instigates  the  restoration  process  (called  the  “sender”;  say 
DCS  A)  by  sending  out  requests  for  alternate  routes  to  each  of  its  neighboring 
DCSs,  which  in  turn  send  out  requests  to  each  of  their  neighboring  DCSs,  etc. 
Thus,  routes  are  pieced  together  a  link  at  a  time.  As  route  requests  reach  the 
destination  (DCS  Z),  confirmations  are  sent  back  along  the  route.  To  control  the 
process,  guarantee  a  good  selection  of  routes,  and  avoid  cycling,  the  DCSs  use 
information  such  as  link  counts  to  retain  efficient  routes  and  discard  the  others. 

63.1.4  Rerouting  Under  centralized  control,  demands  are  rerouted  when 
the  centralized  system  sends  specific  cross-connect  commands  to  the  DCS  over  the 
data  communication  network  (indicating  which  channels  to  cross-connect).  Under 
distributed  control,  the  DCS  instigates  its  own  cross-connects. 

6.3. 1.5  Normalization  This  is  one  of  the  most  difficult  aspects  of  DCS 

reconfiguration  methods.  The  major  difficulty  is  that  after  the  failure  is  repaired 
(e.g.,  a  fiber  cut  is  re-spliced  —  possibly  days  after  the  initial  failure),  it  is  difficult 
to  reroute  demand  back  onto  the  failed  link  without  incurring  further  interruption 
of  service.  In  fact,  if  not  carefully  executed,  the  normalization  step  could 
potentially  incur  a  longer  disruption  of  service  than  the  initial  failure  event.  Where 
adverse  impact  associated  with  normalization  is  unavoidable,  this  impact  can  be 
minimized  by  planning  and  coordination  with  the  users  of  the  services. 

There  are  some  arguments  to  leave  the  rerouted  demands  alone;  i.e.,  unchanged. 
However,  this  option  is  not  likely  to  gain  support  from  network  planners  because 
the  reconfiguration  process  tends  to  use  up  most  of  the  network  spare  capacity  and, 
in  addition,  uses  long  and  inefficient  alternate  routes.  If  the  demands  are  not 
rerouted  after  the  failed  link  is  repaired,  new  connections  would  also  tend  to  be 
routed  inefficiently  in  the  attempt  to  use  up  the  idle  spare  on  the  newly  repaired 
link.  Since  one  of  the  advantages  of  DCS  reconfiguration  is  its  capacity  efficiency, 
this  practice  would  obviate  this  advantage.  To  understand  the  potential  magnitude 
of  this,  consider  a  144  fiber  cable  that  carries  OC-48  systems.  A  full  cut  on  this 
cable  can  easily  involve  hundreds  of  active  STS-1  channels  (a  maximum  of  3,456 
STS-1  channels). 

63.1.6  Summary  of  DCS  Reconfiguration  Methods  Beyond 
implementation  issues,  the  various  methods  can  be  evaluated  in  terms  of  speed  of 
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reconfiguration  and  the  efficiency  of  path  selection.  In  terms  of  speed  of 
reconfiguration,  while  there  are  simulation  studies  of  the  times  of  the  path  selection 
step  for  various  methods,  at  present  there  are  few  observed  execution  times  in  real 
networks.  However,  a  generalization  for  broadband  DCS  reconfiguration  is  that 
methods  that  use  distributed  control  for  the  rerouting  step  have  restoration  time  on 
the  order  of  seconds  while  those  that  use  centralized  control  have  a  restoration  time 
on  the  order  of  minutes. 

Methods  that  use  centralized  control  for  all  steps  are  found  in  [  10]  and  [11]. 
Methods  that  use  distributed  control  for  all  steps  (except  normalization)  and 
dynamic  route  calculation  are  given  by  [12-16].  A  method  that  uses  distributed 
control  for  the  detection,  signaling  and  identification  steps,  centralized  control  for 
the  path  selection  step  (precalculated  routes),  and  distributed  control  for  the 
rerouting  steps  is  given  in  [17]. 

6.4  Service  Layer  The  service  layer  provides  public  and  private  data  and 
voice  services.  The  survivability  techniques  of  this  layer  mitigate  the  impact  of 
uncovered  failures  at  the  logical,  system,  and  physical  layers  as  well  as  failures  and 
degradation  at  the  service  layer  itself  (e.g.,  switch  failures). 

6.4.1  Circuit  Switching 

6.4.1. 1  Size  Limits  A  preventative  technique  to  mitigate  the  severity  of 
service  outages  is  to  limit  the  size  or  capacity  of  a  part  of  a  circuit  switched 
network  to  ensure  that  its  failure  will  not  overly  diminish  the  overall  grades  of 
service.  Limiting  capacity  loss  typically  involves  limiting  the  number  of  lines 
carried  by  (and  thus  user  dependence  on)  any  one  switch  or  transmission  system. 

6.4.1.2  Dynamic  Routing  Dynamic  Routing  is  the  evolutionary  result  of 
traditional  switched-network  hierarchical-routing  structures.  Using  dynamic 
routing,  a  service  network  can  reroute  workloads  around  failures  in  the  network 
without  requiring  changes  in  the  underlying  logical  network.  It  is  most 
appropriately  based  upon  service  demands,  and  is  an  enhancement  of  simple 
alternate  routing.  A  key  attribute  of  dynamic  routing  techniques  is  that  the  path 
choice  is  determined  at  the  time  of  the  call  attempt.  The  choice  is  based  upon  a 
relatively  short  network  status  update  interval  calculation  which  determines 
networking  path  strategy  using  link  occupancy,  optimum  route  sets  and  changing 
network  conditions.  In  the  event  of  a  service  outage,  dynamic  routing  reroutes  new 
service  requests  around  the  affected  area.  The  established  service  connections  in 
the  affected  area  are  disrupted  and  not  rerouted. 

Dynamic  routing  has  the  potential  of  providing  the  survivability  advantages  of  link 
failure  and  tandem  node  failure  protection  in  addition  to  network  routing  efficiency 
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benefits.  Descriptions  of  various  dynamic  routing  methods  for  circuit  switched 
networks  can  be  found  in  [29-32]. 

6.4.1.3  Reconfiguration  Reconfiguring  the  service  network  includes 
options  such  as  dual-homing  and  internal  rearrangements  in  private  networks. 

After  a  failure,  dual-homing  reconfigures  a  service  network  by  redefining  the  home 
switch  for  both  incoming  and  outgoing  traffic  destined  to  the  affected  service  area. 

6.4.1.4  Network  Management  When  two  or  more  network  elements  in  an 
exchange  carrier  and/or  interchange  carrier  network  are  experiencing  degraded 
service,  affecting  performance  for  a  set  threshold  time,  manual  network 
management  conn  ol  methods  such  as  call  gapping  (where  a  limited  number  of  calls 
is  allowed  in  a  given  time  interval),  CANcel  To  (CANT,  which  disallows  calls  to  a 
designated  switch),  CANcel  From  (CANF,  which  disallows  calls  from  a  designated 
switch),  SKIP  (which  causes  affected  traffic  to  skip  the  controlled  trunk  group)  and 
reroute  can  be  used.  See  [49]. 

6.4.1.5  Multi-Serving  Techniques  that  enhance  survivability  at  the  Service 
Layer  include: 

•  wireless  technology, 

•  dual  homing  (two  or  more  serving  central  offices  for  a  given  serving  area)  or 
split  homing  (providing  service  from  two  or  more  nodes) 

•  alternate  access  to  other  ICs,  and 

•  EC  bypass. 

6.4.2  Packet  Switching  Packet  networks  have  their  own  layered  structure  for 
data  communications  within  the  seven  layer  Open  Systems  Interconnection  (OSI) 
standard  of  the  International  Standards  Organization  (ISO)  [25].  Generally,  in  this 
structure,  communication  at  higher  OSI  layers  assumes  secure  communication  at 
lower  OSI  layers.  Additional  data  communication  protocols  such  as  Frame  Relay 
or  other  non-OSI  standard  implementations  may  use  the  OSI  or  other  layering 
structures. 

Techniques  for  survivability  of  packet  networks  at  the  service  layer  include  some 
of  the  following  actions  within  the  transmission  protocols  and  routing  schemes: 

•  retransmission  of  packets  or  data  frames  that  are  lost  or  corrupted, 

•  dynamic  routing  of  packets  for  routing  around  links  and  nodes  that  are  failed 
or  heavily  congested, 

•  priority  routing  of  messages,  and 
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•  congestion  control ;  for  fixed  routing  schemes  —  rerouting  of  traffic  around 
failures  or  heavy  congestion,  queuing  of  packets  or  rejection  of  packets 
between  certain  point-to-points. 

6 .43  Common  Channel  Signaling  CCSNs  (Common  Channel  Signaling 
Networks)  are  dedicated  signaling  networks  which  transport  both  circuit  associated 
(call  setup)  and  non-circuit  associated  (transaction  capabilities)  signaling  messages, 
between  exchanges  or  other  signaling  nodes,  separately  from  the  voice/data  path, 
using  the  SS7  (Signaling  System  Number  7)  protocol14.  CCSNs  support  both 
Integrated  Services  Digital  Network  (ISDN)  and  non-ISDN  or  POTS  calls  in  a 
circuit  switched  network.  The  SS7  protocol  consists  of  the  following  five  parts: 
the  Message  Transfer  Part  (MTP),  the  Signaling  Connection  Control  Part  (SCCP), 
the  ISDN  User  Part  (ISDN-UP  or  ISUP),  the  Transaction  Capabilities  Application 
Part  (TCAP),  and  the  Operations,  Maintenance  and  Administration  Part  (OMAP). 
The  MTP  and  SCCP  provide  the  basic  transport  messages  between  nodes  in  the 
CCSN,  management  of  mated  pairs  of  Signaling  Transfer  Points  (STPs),  duplicate 
Service  Control  Points  (SCPs)  and  databases,  and  load  balancing  within  and  among 
link  sets.  The  ISDN-UP  provides  call  handling  signaling  functions,  such  as  call 
setup  and  release,  for  non-ISDN  as  well  as  for  ISDN  voice  and/or  data  calls.  The 
TCAP  supports  non-circuit  related  activities,  such  as  establishing  forwarding 
arrangements  for  an  800  Service  call.  The  OMAP  provides  operations  and 
management  of  the  logical  and  physical  resources. 

Survivability  techniques  are  needed  for  each  of  the  five  parts  of  the  SS7  protocol. 
Some  survivability  techniques  for  the  MTP  and  the  SCCP  are  described  here. 
Software  diversity,  which  can  be  used  as  a  survivability  technique  for  any  part  is 
also  described.  The  TCAP  relies  mainly  on  SCCP  for  survivability.  Some 
survivability  techniques  for  circuit-switching  networks  using  the  ISDN-UP  for  call 
control  are  covered  in  Section  6.4.1. 

6.4.3.1  Service  Layer  Architecture  for  MTP  CCSNs  as  designed  today 
include  a  high  level  of  network  element  redundancy,  intended  to  avoid  interruption 
of  service  for  the  users  of  the  traffic  networks  which  rely  on  CCSNs  for  signaling. 
Common  Channel  Signaling  (CCS)  network  element  arrangements  are  used  as 
service  layer  survivability  techniques.  The  effectiveness  of  these  arrangements 
depends  on: 

•  proper  engineering  of  CCS  network  elements  to  satisfy  the  desired 
performance  objectives,  and 


14  The  SS7  protocol,  a  variant  of  CCITT  SS  No.  7  |41 ),  is  modeled  after  the  Open  System  Interconnection  (OSI) 
seven-layer  reference  model  (25],  although  it  docs  not  strictly  adhere  to  it.  Figure  5  of  [41]  illustrates  this 
relationship. 
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•  support  of  CCS  network  element  redundancy  and  arrangements  by  the 
interoffice  facilities  (physical,  system  and  logical  layers)  network,  which 
should  have  an  appropriate  level  of  diversity  of  equipment  sites  and  facilities 
(see  sections  6.1, 6.2, 6.3,  and  Table  5.1). 

There  are  two  categories  of  signaling  nodes,  or  Signaling  Points  (SPs): 

a.  Signaling  End  Points  (SEPs),  consisting  of: 

•  CCS  Switching  Offices  (CCSSOs),  which  have  the  capabilities  to  provide 
call  control  on  the  CCSN  and  may  also  interact  with  centralized,  on-line 
databases  to  support  query-oriented  services  such  as  800  Service  and 
Alternate  Billing  Service  (ABS).  CCSSOs  that  have  both  call  control  and 
database  access  capabilities  are  called  Service  Switching  Points  (SSPs). 

•  SCPs,  which  provide  access  to  databases  and  query  processing  systems  for 
call  control  and  customer-related  information,  and  decision-making  for 
intelligent  services.  SCPs  may  be  duplicated,  or  may  operate  as  stand¬ 
alone  units. 

b.  STPs,  which  are  high-speed,  high-reliability,  large-capacity  packet  switches 
for  CCS  message  routing  and  transport.  The  American  National  Standard 
(ANS)-based  reference  architecture  (ANSI  Tl.l  1 1.5-1992)  [43]  has  STPs 
deployed  in  mated  pairs.  Figure  6.9  shows  a  typical  STP  hierarchy  that 
consists  of: 

•  Regional  (remote)  STPs,  or  RSTPs,  which  are  directly  connected  to 
regional  SCPs. 

•  Local  (home)  STPs,  or  LSTPs,  which  route  traffic  between  switching 
offices,  and  forward  traffic  to  RSTPs  for  routing  to  an  SCP. 

•  Gateway  STPs,  either  regional  or  local  STP  pairs,  which  interconnect 
different  carriers'  CCSNs  and  route  signaling  traffic  between  them. 

Note  that  an  STP  can  serve  both  RSTP  and  LSTP  functions.  For  simplicity, 
not  all  possible  connections  are  shown  in  Figure  6.9. 

According  to  ANSI  Tl.l  1 1.5-1992  [43],  there  are  six  categories  of  CCS  link  sets, 
Access  (A),  Bridge  (B),  Cross  (C),  Diagonal  (D),  Extended  (E)  and  Fully- 
associated  (F): 

1.  A-link  sets  connect  SEPs  (SSPs,  CCSSOs  and  SCPs)  to  home  STP  pairs. 

2.  B-link  sets  connect  STP  pairs  of  the  same  hierarchical  level  within  a 
network  (e.g.,  local  STP  pair  to  local  STP  pair). 
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Figure  6.9:  Relationship  Between  a  CCSN  and  Other  Traffic  Networks 

3.  C-link  sets  connect  the  two  mates  in  a  mated  STP  pair.  Under  normal  (no 
failure)  conditions,  they  carry  only  network  management  messages. 

4.  D-link  sets  connect  STP  pairs  of  different  hierarchical  levels  within  a 
network  (e.g.,  local  STP  pair  to  regional  STP  pair)  or  interconnect  STP 
pairs  of  one  network  to  STP  pairs  of  a  different  network^  (e.g.,  EC  STP 
pair  to  IC  STP  pair). 

5.  E-link  sets  directly  connect  SEPs  with  remote  STP  pairs. 

6.  F-link  sets  directly  connect  SEPs  which  have  a  high  community  of  interest. 

A-link  sets  are  deployed  in  pairs  (such  a  pair  is  called  a  combined  link  set)  from  an 
SEP  to  its  home  mated  STP  pair.  E-link  sets  are  deployed  in  pairs  (such  a  pair  is 
called  a  combined  link  set)  from  an  SEP  to  one  or  more  remote  mated  STP  pairs. 
B-  and  D-link  sets  are  deployed  in  quads  (one  combined  link  set  from  each  STP) 
between  two  sets  of  mated  STP  pairs,  as  per  the  ANS-based  reference  architecture. 


15  This  terminology  is  based  on  agreements  by  T1S1.3  at  its  November  1992  meeting  150]. 


35 


Technical  Report  No.  24 

In  such  cases,  the  CCSN  traffic  load  is  balanced  among  the  available  (non-failed) 
links  within  each  of  the  A-,  B-,  D-,  E-  and  F-link  sets.  Upon  STP  or  link  set 
failure,  the  entire  load  is  routed  onto  the  A-,  B-,  D-,  E-  and  F-link  sets  which 
remain  connected,  or  the  C-link  sets.  In  order  for  service  layer  diversity  to  be 
supported  at  the  physical  layer,  (i)  mated  STPs  and  duplicated  SCPs  should  be 
deployed  at  physically  diverse  equipment  sites  and  (ii)  paths  using  physically 
diverse  equipment  sites  and  facility  routes  are  needed  between  the  link  sets  in  a 
combined  A-link  set,  between  the  link  sets  in  a  combined  E-link  set,  and  within  B- 
and  D-link  set  quads. 

Some  networks  may  not  comply  with  the  ANS-based  reference  architecture  (e.g., 
STPs  in  stand-alone,  unmated  configurations).  Performance  implications  for  such 
configurations  are  discussed  in  Section  7.4. 1.3. 

The  basic  mesh  network  segments  used  in  CCSNs  are  [22, 43]: 

•  user  interface  segments,  consisting  of  the  SEP  equipment  which  contains 
shared  CCS  message  handling  capabilities. 

•  network  access  segments,  consisting  of  the  SEP  interfaces16  with  the  A-link 
sets,  the  A-link  set  transport  facilities,  the  STP  interfaces  with  the  A-link 
sets,  and  the  portion  of  the  STPs  which  contain  shared  CCS  message 
handling  capabilities. 

•  backbone  network  segments,  consisting  of  the  STP  interfaces  with  the  B-/D- 
and  C-link  sets,  and  the  B-/D-  and  C-link  set  transport  facilities. 

Figure  6.10  illustrates  a  typical  MTP  arrangement  of  CCS  basic  mesh  network 
segments.  Given  that  sufficient  link  capacity  has  been  provisioned  so  that  all 
surviving  links  can  carry  the  offered  loads,  such  arrangements  provide  service 
layer  survivability  and  limit  service  disruption  due  to:  (i)  any  single  failure  in  the 
network  access  segments  or  backbone  network  segments,  and  (ii)  any  double 
failure  in  the  backbone  network  segments17. 

Some  Survivability  Techniques  in  the  MTP  layer  include: 

•  dynamic  rerouting  after  link  and  STP  failures, 

•  congestion  control  mechanisms, 

•  graceful  MTP  restart  procedures,  and 

•  cluster  routing  procedures. 


16  Interfaces  arc  those  CCS  message  handling  capabilities  dedicated  for  one  link. 

17  This  implies  that  the  four  link  sets  composing  a  quad  arc  provisioned  with  three-way  diversity. 
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The  MTP  layer  dynamically  changes  message  routing  to  account  for  the  changing 
availability  of  links  and  STPs.  Congestion  control  procedures  reduce  traffic  loads 
to  links  and  STPs  by  throttling  message  generation  at  the  sources.  MTP  restart 
procedures  allow  an  STP  to  become  available  at  the  MTP  layer  without  becoming 
overloaded  with  service  and  network  management  messages  and  processing. 
Cluster  routing  reduces  the  amount  of  routing  information  required  in  an  STP,  the 
amount  of  network  management  processing,  and  network  management  message 
transmission  required  if  routes  change  their  status  (e.g.,  available  to  unavailable). 
Cluster  routing  may  also  benefit  SEPs. 


Figure  6.10:  CCS  Basic  Mesh  Network  Segments 

6.4.3.2  Service  Layer  Architecture  for  SCCP  The  SCCP  provides 
additional  network  type  functions  to  the  MTP  procedures.  SCCP  supports 
connectionless  and  connection-oriented  services  that  are  required  for  data  transport 
between  points.  Connectionless  service  refers  to  data  transport  through  the 
network  without  the  set-up  of  a  logical  signaling  connection,  (datagram  service). 
Connection-oriented  service  refers  to  data  transport  through  the  network  using 
logical  signaling  connections  (virtual  circuits).  The  signaling  links,  STPs  and 
SEPs  may  be  combined  in  many  different  ways  to  form  a  “signaling  network.” 
Figure  6.1 1  gives  an  example  of  how  the  MTP  signaling  network  of  Figure  6.10 
can  be  configured  for  SCCP. 
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Figure  6.11:  Example  SCCP  Signaling  Network  Structure 

In  the  above  figure  an  SCCP  signaling  relation  exists  between  each  SEP  and  the 
pair  of  STPs.  The  STP  pair  performs  SCCP  relay  functions  (e.g.,  Global  Title 
Translation,  or  GTT). 

Survivability  Techniques  in  the  SCCP  layer  include: 

•  backing  up  GTT  to  another  STP  pair, 

•  duplicating  SCPs, 

•  load  balancing  between  two  SCPs, 

•  switching  to  backup  SCPs,  and 

•  using  multiple  MTP  routes. 

SCCP  redundancy  through  multiple  signaling  relations  between  two  signaling 
points  can  be  achieved  by  equipping  several  STP  pairs  for  GTTs.  Load  balancing 
between  two  SCPs  spreads  the  impact  of  any  traffic  surges.  When  a  data  base  fails 
in  an  SCP,  SCCP  can  switch  to  use  backup  data  bases  in  other  SCPs.  The  impact 
of  any  changes  to  the  MTP  signaling  network  structure  is  diminished  through  the 
use  of  multiple  MTP  routes  for  some  SCCP  signaling  relations. 

6.4.33  Manual  Traffic  Management  Controls  In  situations  where  the 
SS7  protocol  may  not  provide  protection  against  some  failures  or  implementation 
error  conditions,  several  manual  traffic  management  controls  are  available  and 
others  have  been  proposed  [20].  These  manual  network  traffic  management 
controls  provide  or  will  provide  capabilities  to  render  link  sets  unavailable  to  all 
message  traffic,  and  to  selectively  block  message  traffic  based  on  a  number  of 
selected  MTP  and  SCCP  attributes. 

For  example,  one  control  would  allow  a  network  operator  to  cause  links  to  become 
unavailable.  This  control  could  be  used  to  isolate  a  signaling  point  from  other 
signaling  points  or  isolate  one  network  from  another. 

6.43.4  Relationship  of  Software  Diversity  to  CCSN  Reliability  With  the 
increasing  amount  of  software  deployed  in  the  CCSNs  and  its  increasing 
complexity,  software  and  its  failure  effects  on  the  CCSN  have  become  a  major 
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concern.  Because  software  engineering  is  a  process  susceptible  to  human  error,  no 
program  can  be  guaranteed  to  perform  as  expected.  The  software  “bug” 
contributing  to  the  1991  CCSN  outages,  which  affected  a  large  number  of  customer 
lines,  has  underscored  the  vulnerability  of  CCSNs  to  software  failures. 

Potential  alternatives  related  to  CCSN  reliability  are  included  in  the  following 
paragraphs.  These  potential  alternatives  are  included  here  because  they  are  being 
discussed  in  the  industry;  however,  the  validity  of  these  and  other  potential 
alternatives  is  not  known  and  requires  further  investigation.  In  addition,  some  of 
these  potential  alternatives  may  not  be  viable. 

Four  alternatives  are  identified  here:  i)  multiple  software  developments  in  STPs,  ii) 
different  software  generics  for  backup,  iii)  mixed-supplier  STP  pairs,  and  iv)  E- 
links  to  different  suppliers’  STPs.  This  Technical  Report  considers  mixed-supplier 
STP  pairs  to  be  the  least  viable  of  these  alternatives. 

The  advantages  and  disadvantages  (or  concerns)  of  these  alternatives  to  ensure 
software  diversity  should  he  weighed  by  individual  telecommunications  network 
providers.  For  example,  the  prevalence  of  correlated  software  failure  modes,  the 
cost  of  deploying  these  alternatives,  as  well  as  the  service  and  revenue  lost  by 
correlated  software  failures  should  be  carefully  considered  and  analyzed.  These 
and  the  other  concemf  expressed  below  are  not  exhaustive  listings,  but  rather 
catalysts  for  further  discussion  and  standards  contributions. 

i)  Multiple  software  developments  in  STPs :  Software  generics  meeting  the  same 
requirements  could  be  developed  by  two  different  teams  from  a  single  supplier 
(i.e.,  not  identical  copies).  These  two  developments  would  reside  on  the  two  STPs 
of  a  mated  pair  and  would  help  ensure  the  software  failure  mode  independence 
between  the  two  STPs  in  the  mated  pair. 

Following  are  some  of  the  concerns  related  to  this  alternative: 

•  The  additional  resources  required  for  multiple  software  development  and  the 
associated  costs  concern  both  suppliers  and  network  providers.  These 
additional  costs  would  probably  be  reflected  in  the  prices  network  providers 
pay  for  these  products,  and  would  also  continue  for  each  new  generic. 

•  Although  the  processes  for  multiple  software  developments  may  be  separate, 
many  faults  may  result  from  the  fact  that  the  requirements/specification 
stage  is  usually  the  same. 

•  Uniformity  in  features,  capabilities,  user  interfaces,  etc.  between  the  two 
software  developments  is  extremely  important.  In  other  words,  multiple 
developments  of  the  software  have  to  be  totally  transparent  to  the  users. 
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ii)  Different  software  generics  for  backup :  The  same  software  could  be  used  in 
both  STPs  as  die  current  implementation,  and  a  different  software  generic  (e.g.,  an 
older  version)  could  be  used  for  backup.  For  example,  when  certain  thresholds  are 
exceeded  or  phenomena  observed,  an  older  generic  would  be  loaded  into  the  STPs. 

The  following  are  some  of  the  concerns  with  this  alternative: 

•  Based  on  the  current  understanding  of  the  behavior  of  the  CCS  network,  the 
measurements  to  be  used  and  the  phenomena  to  indicate  major  problems  can 
not  be  clearly  defined.  It  is  therefore  extremely  difficult  to  decide  what 
conditions  should  exist  before  loading  the  STPs  with  a  backup  generic. 

•  The  procedures  to  reboot  and  reload  the  STPs  under  the  failing  conditions 
and  the  problems  which  may  occur  in  transition  need  to  be  carefully 
evaluated.  Without  detailed  procedures  to  ensure  smooth  transitions,  many 
problems  may  occur. 

•  Incompatibilities  may  exist  between  newer  and  older  generics  and  less 
capabilities  may  be  offered  by  the  previous  generics. 

iii)  Mixed- supplier  STP  pairs:  It  has  been  suggested  that  mixed-supplier  STPs  in 
a  mated  pair  configuration  may  protect  against  failure  conditions  propagating 
between  the  mates,  because  the  failure  modes  of  the  two  mates  may  be  independent 
for  both  hardware  and  software. 

The  following  are  some  of  the  problems  to  be  resolved  with  this  alternative: 

•  The  lower-capacity  STP  in  the  mated  pair  becomes  the  limiting  factor. 
Furthermore,  the  CCSN  supported  by  the  mixed-supplier  STP  pair  would 
suffer  the  shortcomings  (capacity,  recovery,  etc.)  of  both  pairs. 

•  If  all  interoperability  issues  are  not  resolved  and  thoroughly  tested,  and 
deployment  is  not  well  coordinated,  introducing  a  different  supplier’s  STP  to 
an  existing  mated  STP  pair  would  risk  end-office  isolations  or  mate  STP 
outages. 

•  OA&M  and  Provisioning: 

-  the  OA&M  and  provisioning  functions  of  each  mate  STP  would  have 
differing  requirements  that  would  create  problems  within  the  respective 
support  areas.  These  problems  include:  link  growth  and  activation, 
making  routing  data  changes  and,  in  particular,  systems  support  during 
critical  network  outages. 

-  If  all  of  the  activities  and  differences  between  each  STP  in  the  mated  pair 
are  not  well  understood,  then  additional  and  exceptional  (e.g., 
interoperability)  training  of  operations  personnel  is  required.  Lack  of 
such  training  would  jeopardize  the  survivability  of  the  CCSN. 

-  Additional,  separate  data  administration  may  be  required,  and  any 
inconsistencies  between  mate  STPs  would  cause  confusion. 
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-  Troubleshooting  mixed-supplier  STP  mated  pair  failures  would  be 
extremely  difficult  with  the  two  suppliers'  technical  support  organizations. 

-  System  initialization  procedures  may  have  designed  capabilities  that 
facilitate  operation  when  the  mate  STP  is  from  the  same  supplier. 

-  There  may  be  feature  differences  between  mixed-supplier  STP  mated  pairs 
which  would  cause  inconsistencies.  Timing  of  feature  introduction  would 
also  be  different  with  the  two  suppliers. 

•  Comprehensive  conformance  testing  needs  to  be  undertaken  (e.g.,  to  ensure 
that  the  STPs  do  not  send  or  depend  on  proprietary  C-link  messages  during 
recovery). 

iv)  E-links  to  different  suppliers'  STPs:  In  addition  to  the  two  A-link  sets  to  the 
home  STP  pair,  E-links  could  be  deployed  to  a  remote  STP  pair  of  a  different 
supplier  [28]  (see  Figure  6.12).  This  architecture  provides  alternate  access  for 
SEPs  to  the  rest  of  the  CCSN  and  also  helps  ensure  software  and  hardware  failure 
mode  independence.  This  architecture  will  improve  the  availability  of  the  network 
access  segment  as  well  as  bypass  problems  that  occur  in  the  home  STP  pair  or  the 
quad  links.  Various  levels  of  diversity  implemented  in  E-links  (e.g.,  diversity 
between  E-  and  A-links)  may  lead  to  different  levels  of  reliability  improvement. 
The  benefits  of  E-links,  which  increase  the  diversity  and  redundancy  of  the  CCSN, 
do  not  depend  on  the  provisioning  of  different  supplier  STP  pairs  in  the  network. 

The  following  are  some  of  the  concerns  with  this  alternative: 

•  In  the  case  where  the  E-links  are  deployed  to  a  pair  of  STPs  from  a  different 
supplier,  all  of  the  functions  and  features  available  at  the  home  STP  pair  may 
not  be  available  at  the  remote  STP  pair.  This  may  cause  capabilities  which 
might  otherwise  be  offered  on  one  STP  pair  to  be  delayed  until  the  supplier 
of  the  other  STP  pair  develops  the  software  generics  to  support  those 
capabilities. 

•  If  E-links  are  deployed,  more  routing  options  exist  and  therefore,  routing 
tables  will  be  more  complex  in  both  switches  and  STPs.  Additional  effort 
may  be  needed  to  administer  and  maintain  the  routing  table  databases. 
Furthermore,  much  more  memory  will  be  needed  at  the  switches  and  STPs 
to  support  the  additional  routing  and  translation  table  information. 

•  The  remote  STP  pair  would  have  to  handle  the  entire  traffic  load  if  the  home 
STP  pair  becomes  inaccessible.  As  a  result,  additional  throughput  and  link 
ports  are  required  at  the  remote  STP  pairs  to  handle  the  traffic  and  links  from 
the  SEPs  when  E-links  are  deployed.  This  may  force  rehoming  and 
reengineering  the  network  more  frequently. 
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Figure  6.12:  E-Link  sets  to  a  Remote  STP  Pair  of  a  Different  Supplier 

The  above  four  alternatives,  in  addition  to  others,  should  be  considered  in  future 
discussion  and  standards  contributions  addressing  software  diversity  and  CCSN 
reliability  improvement.  These  alternatives  are  intended  to  ensure  software  failure 
mode  independence  of  the  STPs  so  that  correlated  software  failures  will  not  result 
in  a  major  network  outage. 

6.5  Integrated  Techniques  Some  integrated  techniques  include  networks  that 
have  combinations  of  DCSs,  automatic  protection  switching  and  self-healing  rings. 
A  few  of  these  are  summarized  below: 

•  A  subnetwork  combining  DCS  restoration  and  point-to-point  systems  using 
APS  with  diverse  routing,  where  the  spare  bandwidth  for  the  DCS  rerouting 
method  can  access  the  spare  paths  of  the  APS  systems  that  are  reserved  for 
their  protection  channels  (e.g..  Figure  6.13).  Among  other  methods,  this  can 
be  accomplished  by  having  direct  optical  interfaces  into  the  DCSs  for  the 
APS  systems. 

•  Combinations  of  many  intersecting  (including  overlapping)  self-healing 
rings  covering  the  network: 

-  traffic  between  rings  is  cross-connected  via  the  low-speed  ports  of  DCSs 
and  Add-Drop  Multiplexers  (ADMs,  see  Figure  6.14),  (A  device  which 
processes  signals  for  combination  and  decombination  purposes)  and 

-  the  self-healing  rings  intersect  via  DCSs  with  integrated  high-speed  or 
fiber  interfaces  (see  Figure  6.15).  The  DCS  plays  the  role  of  ADM  for  all 
the  intersecting  rings  at  that  office,  as  well  as  cross-connecting  traffic 
between  rings. 

•  Self-healing  rings  lie  on  the  perimeter  of  the  network  and  intersect  with  the 
core,  mesh  subnetwork  of  DCSs  (see  Figure  6.16).  Failures  within  the  mesh 
are  protected  by  DCS  restoration  methods.  The  rings  may  either  intersect 
the  DCSs  with  integrated  high-speed  or  fiber  interfaces  or  the  traffic  among 
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7.  Network  Survivability  Performance  Analysis 

7.1  Network  Survivability  Characterization  For  the  purpose  of  a  network 
survivability  performance  study,  the  general  framework  defined  in  Section  5  can  be 
used  and  the  network  can  be  partitioned  into  the  following  four  network 
survivability  layers:  service,  logical,  system  and  physical.  For  each  layer,  a  set  of 
measurements  can  be  defined  and  their  characteristics  can  be  modeled. 

There  are  many  ways  to  characterize  network  survivability  and  define  survivability 
measurements.  Using  survivability  measurements,  one  can  formulate  models  and 
then  define  relevant  measures  such  as  traffic  survivability  ratio,  service 
connectivity  ratio,  average  network  downtime,  etc.,  that  can  be  used  to  estimate  the 
survivability  performance  of  a  real  or  hypothetical  network.  Table  7.1  lists  some 
network  survivability  measurements  for  the  four  network  survivability  layers. 


mm 

Examples  of  Survivability  Measurements 

Service 

end-to-end  grade  of  service,  number  of  calls,  number  of  connected 
subscribers,  traffic  volume,  carried  load,  packet  throughput,  utilization 

Logical 

number  of  surviving  lower-rate  transmission  channels  (DS-ls,  or 

DS-3s),  connectivity 

EflSH 

number  of  surviving  transmission  systems 

building  integrity,  connectivity  and  number  of  surviving  cables 

Table  7.1:  Examples  of  Network  Survivability  Measurements 

7.2  Network  Survivability  Analysis  Models  There  are  two  basic  approaches 
to  survivability  analysis  defined  here.  The  first,  the  Given  Occurrence  of  Failure 
(GOF)  survivability  analysis  model,  uses  a  conditional  approach  and  defines 
survivability  measures  for  a  network  assuming  that  given  failure(s)  have  occurred. 
This  GOF  model  may  either  use  probabilistic  weighting  of  the  resulting  states  of 
the  network  and  resulting  network  restoration  and/or  repair  after  the  failure  or  it 
may  use  deterministic  analysis  of  these  states.  Both  approaches  can  be  used  to 
evaluate  different  restoration,  repair  or  preventative  methods  depending  on  which 
type  of  comparison  characteristics  are  critical. 

The  second  survivability  analysis  model,  the  Random  Occurrence  of  Failure  (ROF) 
model,  uses  probability  of  network  failure(s)  and,  possibly,  rates  of  repair  and/or 
restoration  to  calculate  various  probabilistic  measures  of  network  unservability  or 
loss  (e.g.,  the  expected  amount  of  time  a  network  is  unservable). 

7.2.1  Given  Occurrence  of  Failure  Survivability  Model  In  the  GOF 
survivability  model,  it  is  assumed  disaster  has  already  occurred  and  a  given 
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element  (or  elements)  of  the  network  has  already  failed.  Hence,  from  a  modeling 
point  of  view,  the  occurrence  of  a  given  failure  is  assumed.  Users  often  phrase 
their  survivability  requirements  in  terms  of  which  types  of  failure  they  want  their 
traffic  protected  from,  and  what  proportion  of  the  traffic  should  survive.  The 
general  procedure  for  evaluating  GOF  measures  is  as  follows  (see  also  [26]): 

1.  define  a  survivability  measurement  (see  Table  7.1), 

2.  identify  the  sample  space  (i.e.,  the  failures  that  can  occur), 

3.  choose  the  failures  of  interest,  and 

4.  calculate  the  network  survivability  measures. 

7.2.2  Random  Occurrence  of  Failure  Survivability  Model  The  ROF 
survivability  model  is  a  general  form  of  the  well  known  availability  model.  This 
model  is  based  on  the  assumption  that  failures  can  be  characterized  by  random 
variables  with  given  probability  distribution  functions.  The  general  procedure  for 
evaluating  availability  based  network  survivability  measures  is  as  follows: 

1.  obtain  observed  rates  of  failure  and  repair/restoration, 

2.  define  network  survivability  measurements  of  interest, 

3.  identify  the  network  state  space  (i.e.,  the  various  states  in  which  a  network 
can  reside  concerning  whether  its  components  are  working  or  failed), 

4.  determine  the  survivability  measurement  for  each  network  state, 

5.  determine  or  assign  the  transitional  probability  from  each  state  to  another, 
and 

6.  calculate  the  network  survivability  measures  (e.g.,  the  expected  units  lost  or 
unservable  over  time). 

Note:  Step  6  is  often  a  very  complex  task  and  sometimes  impossible  to  evaluate 
exactly  in  networks. 

7.23  Application  of  GOF  and  ROF  Models  The  above  two  survivability 
models  can  be  applied  to  develop  performance  measures  for  all  four  survivability 
layers.  For  example,  the  fraction  of  buildings  remaining  after  a  hurricane  of  a 
specified  strength  and  extent  is  a  survivability  attribute  for  the  physical  layer.  The 
measures  used  in  system  and  logical  layers  are  usually  based  on  a  static  traffic 
model,  i.e.,  the  traffic  is  assumed  to  change  over  a  long  interval  of  time,  an 
assumption  that  may  be  inappropriate'  for  higher  network  layers  (e.g.,  a  service 
layer  network). 

In  both  the  ROF  and  GOF  models,  it  is  important  to  first  specify  what 
measurements  of  the  network  to  capture,  then  to  obtain  the  survivability  measures 
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using  procedures  summarized  in  Section  7.2.  To  clarify  the  application  of  the 
models,  more  discussion  on  their  measures  follows. 

73  GOF  Network  Survivability  Measures  In  the  GOF  model,  a  selected 
measurement,  say  x ,  of  the  network  can  be  quantified,  such  as  in  Figure  7.1,  failure 
of  x  occurs  at  time  to  and  the  following  survivability  attributes  can  be  defined: 

Sa  the  fraction  of  x  that  remains  after  failure  has  happened  and  before 
restoration  starts  (1-U), 

Su  the  fraction  of  x  unservable  after  failure  has  happened  L.nd  before 
restoration  starts  (U), 

Sr  the  fraction  of  x  that  is  restored  at  ti, 

to  the  time  the  failure  occurs, 

tr  the  duration  required  until  fraction  Sa+Sr  of  x  is  restored,  and 
tR  the  duration  required  until  all  of  x  (t2)  is  restored  (D). 

In  general,  Sa,  Su,  Sr,  tr  and  tR  are  random  variables  depending  on  network 
topology,  type  of  failure,  restoration  techniques,  etc. 


Figure  7.1:  Survivability  Attributes 

Using  the  three  main  survivability  attributes,  Sa,  Sr,  and  tr  defined  above,  the 
survivability  of  a  network  for  a  given  “measurement”  can  be  characterized.  In 
general,  these  attributes  are  random  variables,  and  each  one  has  a  probability 
(frequency  or  distribution)  function. 

A  number  of  different  quantities  of  interest  can  be  derived,  each  capturing  a 
particular  network  characteristic.  For  instance,  for  Sa,  Sr,  and  tr,  we  can  obtain  the 
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expected  values,  the  worst-case  values,  the  p-percentile  values,  and  the  probability 
of  zero.  Note  that  each  measure  captures  a  different  aspect  of  network 
survivability.  Two  examples  on  network  survivability  analysis  and  assessment 
based  on  the  GOF  model  are  given  in  Appendix  B. 

7.3.1  Service  Layer  Examples 


Service  Outage  Performance  Based  on  Traffic  Blocking 
Traffic  blocking  level  can  be  used  as  a  survivability  measurement  in  circuit 
switched  network  survivability  analysis.  In  this  case,  the  survivability  parameter 
can  be  the  “relative  average  traffic  blocking  level.”  Many  grades  of  availability 
could  be  defined,  in  terms  of  various  blocking  levels  and  the  duration  of  such 
blocking.  The  basic  two  states  are  defined  as  Acceptable  (no  service  outage)  and 
Unacceptable  (service  outage)  as  shown  in  Figure  7.2. 


f\Vt  Acceptable  State  BT  =  Blocking  Threshold  Time 

ifiil  Unacceptable  State  D  =  Outage  Duration 


Figure  7.2:  Two-State  Traffic  Model 

The  acceptable  state  can  be  defined  as  the  case  where  the  blocking  is  less  than 
some  specified  level,  Bj.  The  threshold  of  Bt  could  be  the  engineered  blocking, 
or  some  higher  level.  The  cause  of  blocking  may  be  network  failure  conditions 
(capacity  loss)  as  well  as  overload  situations  (demand  increase). 

The  unacceptable  state  can  be  defined  as  the  case  where  blocking  exceeds  the 
blocking  threshold  By.  The  Service  Outage  Time,  D  is  defined  to  be  the  interval 
between  the  point  where  blocking  exceeds  the  threshold  Bt,  and  the  first  point 
where  blocking  drops  below  Bj.  The  blocking  threshold  which  defines  the 
transition  between  the  two  states  can  be  determined  on  the  basis  of  specific  user 
group  requirements  and  objectives.  Although  a  two-state  model  is  defined  here, 
other  states  can  be  defined,  including  marginal  states  of  acceptability.  For 
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example,  a  network  manager  may  define  other  states  which  will  allow  more  refined 
actions. 

We  will  relate  the  blocking  survivability  parameter  to  the  network  failure  impact 
categories  in  Section  5.2.  Given  a  failure,  three  parameters,  U,  D  and  E  are  used  to 
classify  the  magnitude  of  the  service  failure.  Here,  U  is  the  load-weighted 
percentage  of  point-to-points  whose  blocking  exceeds  the  threshold  By,  D  is  the 
maximum  duration  for  which  all  point-to-points  included  in  that  percentage  exceed 
the  threshold,  and  E  is  the  set  of  point-to-points  considered.  For  example,  if  the 
combination  of  U,  D  and  E  is  within  the  proper  qualifying  region  as  defined  in 
Section  4,  then  a  catastrophic  service  outage  has  occurred  for  this  user  group. 

User  Lost  Erlang 

The  User  Lost  Erlang  (ULE)  measure  proposed  in  [33],  is  a  single  dimensional 
measure  of  service  outages.  Single  dimensional  measures  are  not  adequate  to 
capture  the  three  essential  parameters  of  an  outage  (unservability,  duration  and 
extent).  For  example,  under  the  ULE  measure  a  30  million  access  line 
Interexchange  network  with  a  loss  of  one  million  Erlangs  over  six  minutes  is 
equated  with  a  one  million  access  line  EC  (Exchange  Carrier)  network  with  a  loss 
of  ten  thousand  Erlangs  over  ten  hours. 

7.3.2  Logical  and  System  Layer  Examples 

Restoration  Time 

Given  a  network  failure,  the  restoration  time  is  the  time  from  loss  of  first  working 
signal  until  the  time  the  last  signal  is  restored  by  the  restoration  method.  Note  that 
depending  on  various  types  of  failures  (e.g.,  node  failure)  and  network 
configurations,  it  is  possible  that  not  all  lost  signals  can  be  restored.  Also, 
restoration  times  of  different  methods  must  be  compared  at  the  same  level  of 
restoration,  i.e.,  over  similar  networks  with  the  same  number  of  lost  and  restored 
signals. 

Consistent  with  Section  6,  restoration  time  can  be  generally  broken  into  the 
following  phases: 

1.  Detection  time:  the  time  from  loss  of  first  signal  until  the  restoration 
method  detects  the  loss  or  alarm. 

2.  Notification  time :  the  time  required  to  notify  the  control  architecture  of  the 
failure. 

3.  Identification  of  failed  signals. 

4.  Path  selection  time :  the  time  required  to  obtain  alternate  paths  for  lost 
signals  (by  data  look-up  or  algorithm). 
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5.  Rerouting  time:  the  time  required  to  complete  all  cross-connects  or 
switching  to  alternate  paths. 

Impacts  of  restoration  time  on  various  network  services  can  be  found  in 
Appendix  D. 

Verification  Time 

In  addition  to  restoration  time,  practical  implementation  of  most  restoration 
methods  will  verify  their  operation.  Verification  time  is  defined  to  be  the  time 
required  to  verify  successful  cross-connection  or  switching  by  an  operations 
system,  network  controller,  or  network  element.  In  some  methods,  the  verification 
process  starts  after  the  last  signal  is  restored,  while  in  others  it  occurs  in  parallel 
with  the  restoration  process. 

Verified  Restoration  Time 

While  the  verification  time  is  not  strictly  included  in  the  restoration  time  definition, 
its  implementation  can  vary  widely  across  restoration  methods  and  its  measure  can 
be  important  for  the  performance  evaluation  of  different  restoration  methods. 
Therefore,  a  combined  measure  is  defined  here.  Given  a  network  failure,  verified 
restoration  time  is  defined  to  be  the  time  from  the  loss  of  the  first  working  signal 
until  the  time  the  last  signal  is  restored  and  verified. 

Restoration  Ratio 

Given  a  set  of  simultaneous  network  failures,  the  restoration  ratio  is  x/y,  where  x  is 
the  number  of  signals  restored  by  the  restoration  method  and  y  is  the  number  of 
signals  that  were  lost  because  of  the  network  failure.  This  measure  is  used  to 
compare  two  similar  restoration  methods  on  equivalent  networks,  i.e.,  with  the 
same  nodes  and  links,  the  same  amount  of  working  signals,  the  same  amount  of 
spare  capacity,  and  the  same  amount  of  lost  signals,  x.  An  aggregate  restoration 
ratio  can  also  be  calculated  by  averaging  the  restoration  ratio  over  a  given  set  of 
non-simultaneous  failures  (e.g.,  all  possible  single  link  failures  or  all  single  node 
failures  or  both). 

Given  the  failure  set,  a  restoration  method  can  also  be  compared  against  the 
optimal  ratio,  i.e.,  the  maximum  restoration  ratio  for  a  given  network.  Note  that  in 
some  cases  the  maximum  ratio  may  be  intractable  to  compute.  For  example,  with 
DCS  reconfiguration  methods,  if  a  link  rerouting  method  is  used  (see  Section  6.3.1 
for  definitions),  the  theoretical  maximum  number  of  signals  that  can  be  restored  for 
any  single  link  failure  by  using  any  link  rerouting  method  is  the  solution  to  a 
maximum  “flow”  problem  (in  optimization  terminology)  between  the  two  end 
points  of  the  failed  link.  This  is,  generally,  an  easy  problem  to  solve.  However,  if 
point-to-point  rerouting  methods  are  used,  then  this  theoretical  maximum  number 
is  at  least  as  large  as  that  of  link  rerouting  methods,  but  computing  the  theoretical 
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maximum  number  of  signals  that  can  be  restored  by  using  point-to-point  rerouting 
methods  is,  in  general,  an  intractable  problem. 

Efficiency  Ratio 

Given  a  possible  set  of  non-simultaneous  failures  in  a  network  (e.g.,  all  possible 
single  link  failures),  the  efficiency  ratio  is  defined  to  be  how  much  spare  signal 
capacity  is  necessary  to  guarantee  maximum  restoration  over  all  possible  network 
failures,  expressed  as  a  ratio  =  number  of  spare  signals  /  number  of  working 
signals.  Note  that,  in  general,  for  node  failures,  signals  that  terminate/originate  at 
the  failed  node  cannot  be  restored,  and  hence  maximum  restoration  is  less  than 
100%. 

Given  the  possible  failure  set  in  a  network,  a  restoration  method  can  also  be 
compared  against  the  optimal  efficiency  ratio, l.e.,  the  minimum  theoretical 
efficiency  ratio  in  that  network  for  any  possible  restoration  method.  As  with  the 
restoration  ratio,  the  minimum  ratio  may  be  intractable  to  compute. 

7.4  ROF  Network  Survivability  Measures  In  contrast  to  the  GOF  model, 
the  ROF  model  makes  no  assumptions  that  a  given  failure  has  occurred  (i.e.,  ^  in 
Figure  7.1  is  random).  If  the  probability  distribution  of  time  between  failures  is 
exponential,  the  failure  process  could  be  fully  characterized  by  Mean  Time 
Between  Failure  (MTBF),  or  Mean  time  Between  Service  Outages  (MTBSO)  and 
the  Unservability  (o),  could  be  estimated  in  terms  of  Mean  Time  To  Repair 
(MTTR)  and  MTBSO.  Service  unservability,  based  on  traffic  blocking  is  an 
important  measurement  in  the  service  layer  and  is  discussed  next. 

7.4.1  Service  Layer  Examples 

7.4.1.1  Circuit  Switching 
Overall  Average  Blocking 

The  ROF  model  can  be  used  to  estimate  the  expected  Overall  Average  Blocking 
(OABavg)  when  enough  historical  data  on  the  long-term  variation  of  the  blocking 
level  is  available  and  an  acceptable  estimate  for  MTTSR  (Mean  Time  to  Service 
Restoration)  and  MTBSO  (Mean  Time  Between  Service  Outages)  can  be  obtained. 
It  can  be  shown  that  [34]: 

O  AB  avg  =  ( 1  -\))B  n  +  \)By 

where  Bn  and  Bu  are  the  blocking  in  acceptable  (normal)  and  unacceptable  states 
respectively.  The  Unservability  (o),  is  defined  as  the  long-term  ratio  of  outage 
time  to  scheduled  service  time.  Under  the  assumption  of  exponentially  distributed 
failure  and  restoration  times  we  have: 

MTTSR 

MTTSR  +  MTBSO 
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Annual  Loss  of  Traffic 

Annual  Loss  of  traffic  is  the  amount  of  traffic  that  the  network  is  expected  to  lose 
per  year  due  to  failures.  This  aspect  of  survivability  takes  into  account  the 
frequency  and  duration  of  failures  using  the  ROF  model. 

A  measure  for  the  annual  loss  of  traffic  is  Expected  Loss  of  Traffic  (ELT), 
measured  in  units  such  as  carried  load  (e.g.,  in  units  of  Erlangs)  per  year. 

7.4.1.2  Packet  Switching 

Packet  Delay  (Transport  Time) 

Survivability  of  packet  switching  networks  can  be  measured  using  packet  delay  or 
transport  time.  Others  measurements  are:  message  loss  probability,  undetected 
error  probability,  message  out  of  sequence  probability,  and  availability18.  Packet 
delay  can  be  computed  on  a  point-to-point  basis  and  averaged  over  time,  i.e.,  point- 
to-point  packets  from  A-Z  see  an  average  delay  over  a  given  interval  of  time. 
Aggregate  measures  are  computed  by  averaging  the  weighted  point-to-point  delays 
over  all  point-to-points,  where  the  weighting  is  the  proportional  relative  packet 
traffic  (load)  to  the  whole  traffic  in  the  network  of  interest. 

Failure  can  be  related  in  terms  of  Section  4,  by  analyzing  a  distribution  of  point-to- 
points  whose  delay  exceeds  a  given  threshold.  This  threshold  would  depend  on  the 
grade  of  service  objectives  for  the  particular  network.  An  example  of  an  aggregate 
unservability  measure  could  be  the  weighted  proportion  of  point-to-points  whose 
delay  exceeds  the  threshold,  e.g.,  50%  of  traffic  exceeds  the  threshold  (the  point- 
to-points  whose  delay  exceeds  the  threshold  constitute  50%  of  the  total  traffic  of 
the  network). 

7.4.13  Common  Channel  Signaling 

Performance  Evaluation 

Most  performance  characteristics  commonly  used  for  packet  networks  also  apply 
to  CCSNs.  Performance  of  CCSNs  can  be  evaluated  using  measurements  such  as 
availability  (or  downtime),  transport  time  (delay),  message  loss  probability, 
undetected  error  probability,  message-out-of-sequence  probability,  and  signaling 
link  bit  error  ratio  [21  ].  The  first  two  measurements  have  direct  impact  on 
planning  and  engineering  survivable  CCSNs,  and  are  discussed  below. 

Performance  measurements  such  as  delay  or  utilization  can  be  specified  as  peak 
(maximum)  over  a  given  time  interval,  or  mean  (average)  over  a  given  time 
interval,  or  else  are  relative  to  the  completion  of  a  given  type  of  call.  Whether 


18  CCITT  Recommendation  1.355  |47|  on  ISDN  availability  identifies  the  wide  range  of  ISDN  network 
performance  parameters  (circuit-switched  and  packet-switched)  and  their  thresholds  for  defining  unacceptable  states. 


51 


Technical  Report  No.  24 

peak  or  mean  delay/utilization  measures  are  appropriate,  and  how  long  the  time 
intervals  over  which  these  statistics  should  apply  need  to  be  determined  to  derive 
appropriate  CCSN  traffic  engineering  parameters.  For  more  detailed  information 
on  SS7  monitoring  and  measurements,  see  [48]. 

For  a  circuit  switched  network,  for  example,  traffic  engineering  is  based  on 
performance  objectives  such  as  maximum  busy-season  busy-hour  blocking.  More 
study  of  the  characteristics  of  CCS  traffic  which  results  from  voice,  ISDN  or  other 
services  (this  traffic  is  likely  to  be  more  bursty  than  traffic  in  a  circuit  switched 
network)  is  needed,  to  determine  the  relevant  time  scales  for  the  CCSN. 

Downtime  Objectives 

Downtime  objectives  are  intended  to  control  the  amount  of  time  a  CCSN  (or  a 
segment  thereof)  is  unable  to  perform  its  required  signaling  functions.  They  can  be 
represented  by  a  single  number  equal  to  the  long-term  percentage  of  time  a  CCSN 
—  or  segments  thereof  —  are  expected  to  be  “down.”  As  such,  downtime 
objectives  can  significantly  influence  end  user  perception  of  service  quality.  The 
expected  percentage  of  downtime  for  a  network  element  can  be  interpreted  either 
as: 

•  the  average  downtime  over  many  years  for  this  network  element,  or  as 

•  the  average  downtime  over  one  year  for  a  population  of  the  network 
elements. 

According  to  American  National  Standard  (ANS)  ANSI  Tl.l  1 1.6-1992,  Section 
5.1.2  [43],  the  MTP  downtime  objective  for  the  CCS  basic  mesh  network  shown  in 
Figure  7.3,  corresponds  to  (an  average  of)19  no  more  than  10  minutes  downtime  per 
year  for  the  signaling  paths  between  two  SEPs,  and  is  broken  down  as  follows  (see 
Section  6.4.3. 1): 

•  each  user  interface  segment  should  be  down  (an  average  of)19  no  more  than 
3  minutes  per  year 

•  each  network  access  segment  should  be  down  (an  average  of)19  no  more  than 
2  minutes  per  year,  and 

•  the  backbone  network  segment  should  be  down  a  negligible  amount  of  time 
(i.e.,  close  to  0  minutes  downtime  per  year).  Note  that  downtime  for  this 
segment  includes  failures  that  prevent  use  of  the  backbone  segment  but  do 
not  by  themselves  disable  any  other  segment(s). 


19  The  original  text  in  (43)  refers  to  "nominal  requirements"  that  arc  interpreted  in  this  Technical  Report  as  "average 
downtime  numbers." 
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Figure  7.3:  ANSI  (Tl.111.6)  Downtime  Objec*  s  for  CCS  Basic  Mesh 

Network  Segments  (MTP  only) 

To  meet  the  above  allocation,  the  ANS-based  (ANSI  Tl.l  1 1.5-1992,  Section  7.2.1) 
reference  architecture  uses  two-way  diversity  for  the  A-Iink  sets,  mated  pairs  of 
STPs,  and  three-way  diversity  for  the  B-/D-link  sets  [43].  If  three-way  diversity  is 
not  achieved  in  the  backbone  segment,  the  downtime  of  that  segment  may  not  be 
negligible.  Also,  if  mated  STP  pairs  are  not  deployed  in  the  CCS  architecture,  then 
the  downtime  objectives  of  the  network -access  segment  and  the  backbone  segment 
may  not  be  met.  Hence,  the  10-minute  end-to-end  downtime  objective  may  not  be 
achievable. 

When  the  SEPs  in  Figure  6.10  (Section  6.4.3. 1 )  are  both  CCSSOs,  the  10-minute 
end-to-end  downtime  objective  and  the  above  allocation  to  network  segments 
correspond  to  a  single  trunk  group  with  its  terminating  CCSSOs  interconnected 
using  the  ANS-based  reference  architecture  (see  Figure  7.3).  It  also  applies  when 
instead  of  CCSSOs  we  have  SSPs  or  SCPs. 

One  or  more  switches  homed  on  an  STP  pair  cannot  communicate  with  all  switches 
homed  on  another  STP  pair  (but  all  switches  on  any  one  STP  pair  can  still 
communicate  with  other  switches  homed  on  that  pair)  if  any  of  the  following 
occurs: 

•  the  entire  B-/D-link  set  quad  fails, 

•  one  of  the  STPs  in  either  mated  pair  and  the  B-/D-link  set  pair  of  the  other 
STP  in  the  mated  pair  fail,  or 
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•  one  or  more  A-link  set(s)  to  one  of  the  STPs  in  either  mated  pair  and  its  C- 
link  set  and  the  B-/D-link  set  pair  of  the  other  STP  in  the  mated  pair  fail. 

Delay  Objectives 

Delay  objectives  have  the  most  direct  impact  on  CCS  traffic  engineering.  Three 
major  types  of  message  transfer  delay  measures  (as  well  as  hybrids  of  these  types) 
are  usually  considered:  peak  (or  maximum),  percentile,  and  mean  (or  average).  A 
peak  (or  maximum)  delay  requires  that  messages  be  transmitted  within  a  given 
time  period.  A  percentile  delay  requires  that  a  certain  percentage  of  all  messages 
be  transmitted  within  a  given  time  period.  A  mean  (or  average)  delay  requires  that, 
on  the  average,  messages  be  transmitted  within  a  given  time  period.  Some 
measures  for  cross-STP  delays  are  provided  in  [23].  No  objectives  are  defined  in 
the  SS7  standards  for  either  link  set  delays  or  end-to-end  delays20. 

Even  when  end-to-end  delay  objectives  are  completely  defined,  allocating  them  to 
individual  cross-network  element  delays  is  difficult.  When  a  hybrid  delay 
objective  involving  the  peak,  percentile  and  mean  criteria  is  considered  as  in  the 
FCC  Docket  No.  86-10  case21,  finding  an  effective  allocation  policy  is  a  yet  more 
complex  difficult  task.  For  example,  jt  is  difficult  to  determine  how  the  5-second 
call  setup  delay  for  the  800  Service  mandated  by  FCC  Docket  No.  86-10  should  be 
apportioned  to  the  call  piece  parts:  POTS  connection  to  switch,  intra-SSP 
processing,  processing  in  network  access  segments  of  the  CCSN,  interconnection 
to  ICs,  access  to  SCPs,  etc.  Criteria/rules  are  needed  to  effectively  perform 
allocation  of  delay  objectives  among  individual  or  groups  of  CCS  network 
elements.  For  example,  one  may  want  to  increase  the  utilization  of  (thus  allocating 
higher  admissible  delays  to)  more  expensive  CCS  network  elements. 

Utilization  Objectives 

Signaling  link  set  utilization  is  defined  as  the  fraction  of  time  message  signal  unit 
(MSU)  packet  transmission  occurs  over  the  link  set.  It  can  be  computed  as  the 
ratio  of  the  link  set’s  carried  load  to  the  link  set's  capacity  (or  link  set  speed). 
Processor  utilization  can  be  computed  in  a  similar  fashion.  Low  utilization  implies 
higher  costs.  High  utilizations,  on  the  other  hand,  may  cause  unacceptable  delays. 
Even  if  the  network  element  is  engineered  for  a  low  utilization,  CCS  networic 
element  failures  might  cause  some  additional  signaling  traffic  to  be  diverted  to  the 
network  element,  thereby  increasing  (usually,  doubling  for  a  single  network 


20  Some  delay  objectives  have  been  mandated  by  regulatory  agencies.  For  example,  average  end-to-end  call  setup 
delays  under  5  seconds  for  the  800  Service  arc  required  to  comply  with  FCC  Docket  No.  86-10  [24]. 

21  FCC  Docket  No.  86-10  requires  that  (i)  97%  of  the  800  Service  traffic  involve  call  set-up  times  of  5  seconds  or 
less  by  March  4, 1993,  and  (ii)  100%  of  the  800  Service  traffic  involve  maximum  call  set-up  times  of  5  seconds  or 
less  and  average  call  set-up  times  of  2.5  seconds  or  less  by  March  4, 1995. 
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element  failure)  its  carried  load  and  its  utilization.  Utilization  upper  bounds  can  be 
used  to  ensure  both  congestion  control  and  survivability  at  reasonable  costs. 

As  an  example,  [23]  requires  that  a  signaling  link  set  load  be  limited  to  0.4  Erlangs 
per  link  under  normal  conditions  (i.e.,  utilization  objective  of  at  most  40%),  so  that 
if  a  failure  occurs,  an  expected  peak  load  of  at  most  0.8  Erlangs  of  CCS  traffic 
would  be  carried  by  the  surviving  signaling  link  set.  The  time  scale  involved  in 
this  utilization  objective  is  however  left  undefined  in  [23].  It  is  not  clear  whether 
hourly  averages  adequately  capture  the  variations  in  carried  loads,  or  whether 
averages  over  some  other  period  (e.g.,  five  minutes)  are  needed. 

Based  on  [23],  the  utilization  objective  guideline  of  40%  applies  to  A-,  B-,  D-  and 
E-link  sets,  which  are  then  engineered  to  not  exceed  40%  utilization  under  normal 
(no  failure)  conditions.  To  support  double  failures  in  the  backbone  network 
segments,  B-  and  D-link  sets  could  be  engineered  to  a  utilization  lower  than  40% 
under  normal  (no  failure)  conditions.  Currently,  there  are  no  standardized 
utilization  guidelines  for  either  C-link  sets  or  F-link  sets.  C-link  sets  carry  CCS 
traffic  only  in  case  of  a  combined  B-/D-link  set  or  A-/E-link  set  failure  (they  also 
carry  network  management  messages  between  STP  mates). 

Setting  adequate  utilization  objectives  ensures  survivability  in  the  event  of  a  single 
failure  in  the  network  access  segments,  or  a  single  failure  in  the  backbone  network 
segments,  or  a  double  failure  in  the  backbone  network  segments.  Based  on  [43],  in 
the  event  of  such  a  failure,  congestion  is  avoided  if  the  utilization  remains  below 
80%.  However,  a  backbone  network -segment  in  a  CCSN  engineered  according  to 
this  requirement  cannot  withstand  a  double  failure  such  as  the  simultaneous  failure 
of  an  STP  and  one  of  the  two  B-  or  D-link  sets  connecting  its  mate  to  the  other 
mated  pair  of  STPs.  Among  other  alternatives  [50]  to  lessen  vulnerability  to  this 
type  of  double  failure,  the  utilization  objective  for  B-/D-link  sets  could  be  reduced. 

Another  requirement  given  in  [23]  is  that  an  STP  should  be  able  to  handle  its 
mate's  traffic  load  in  addition  to  its  normal  (no  failure)  traffic  load.  No  similar 
requirement  for  CCS  nodal  network  elements  other  than  STPs  is  given  in  [23]. 

SCPs  are  candidates  for  similar  mating  requirements. 

7.4.2  Logical  and  System  Layer  Examples 

Annual  Loss  of  Traffic 

Annual  Loss  of  traffic  is  the  amount  of  traffic  that  the  network  is  expected  to  lose 
per  year  due  to  failures.  This  aspect  of  survivability  takes  into  account  the 
frequency  and  duration  of  failures  using  the  ROF  model.  A  good  measure  for  the 
annual  loss  of  traffic  is  Expected  Loss  of  Traffic  (ELT)  -  measured,  in  the  logical 
layer,  in  units  such  as  DS-1  or  DS-3  minutes  per  year. 
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To  compute  this  measure,  for  a  network  with  say  DS-3  demands,  consider  all  the 
pairs  of  nodes  in  the  network  for  which  zero  DS-3  demand  or  “traffic”  exists  (a 
DS-3  demand  between  node  A  and  node  Z  is  defined  to  be  a  contiguous 
transmission  signal  that  is  accessible  at  the  DS-3  rate  at  both  nodes  A  and  Z  and 
does  not  drop  below  the  DS-3  rate  between  A  and  Z).  For  example,  in  local 
exchange  networks  a  DS-3  signal  may  be  provided  to  carry  a  customer  demand  at 
the  DS-3  rate  or  it  may  be  provided  to  multiplex  or  groom  many  DS-1  signals  from 
various  origination  and  destination  pairs  for  transport  between  the  node  pair  of  the 
DS-3. 

If  the  network  of  interest  has  self-healing  capability,  each  DS-3  demand  would 
normally  be  routed  over  a  path  of  self-healing  systems/subnetworks.  Given  a  pair 
of  nodes,  { i,  j},  and  the  number  of  DS-3  demands  between  them,  dy,  let  { R ijk } 
denote  the  unique  paths  between  i  and  j  over  which  the  { i,  j }  DS-3  demands  route. 
Correspondingly,  let  dyk  denote  the  number  of  { i,  j }  DS-3  demands  that  route  over 

pathRijk  (£dyk=dij).  ELTy  is  defined  to  be  the  sum  over  k  of  the  product  of  dijk 
k 

and  the  expected  downtime  (in  minutes  per  year)  of  path  Ryk.  ELT,  the  quantity 
for  the  whole  network,  is  the  sum  of  ELTy  over  all  pairs  of  nodes  in  the  network 
for  which  non-zero  traffic  exists.  The  expected  downtime  for  each  path,  Rijk,  is 
calculated  by  taking  into  account  the  probability  and  average  duration  of  failures. 

Annual  Loss  of  Connectivity 

Annual  Loss  of  Connectivity  is  the  propensity  of  the  network  to  lose  all 
connectivity  between  a  pair  of  nodes  per  year.  Connection  between  two  nodes,  i 
and  j,  is  lost  if,  for  all  paths  Rijk,  there  exists  no  working  or  protection  channels 
able  to  carry  the  demand,  djj.  This  aspect  of  survivability  also  takes  into  account 
frequency  and  duration  of  failures  using  the  ROF  model. 

The  consequences  of  losing  connection  between  two  nodes  can  be  more  serious 
than  the  consequences  of  losing  an  equivalent  amount  of  traffic  throughout  the 
network  without  losing  connectivity.  Loss  of  connectivity  can  lead  to  the  loss  of 
important  emergency  and  high  priority  traffic  or  create  a  situation  of  isolation. 

Two  measures  can  be  defined  for  this  aspect  of  survivability,  the  Average  Expected 
Downtime  of  Connection  (AEDC),  measured  in  units  of  minutes  per  year,  and 
Probability  Distribution  of  Downtime  of  Connection  (PDDC). 

The  first  measure,  AEDC  provides  a  summary  of  the  loss  of  connectivity.  The 
second  measure,  PDDC,  allows  a  more  detailed  study  of  the  likelihood  of  incidents 
of  prolonged  loss  of  connection. 
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Given  a  pair  of  nodes,  {i,  j},  EDCy  is  defined  to  be  the  expected  value  of  the 
downtime  of  connection  between  nodes  i  and  j  measured  in  minutes  per  year.  For 
the  same  pair  of  nodes,  PDDQ  j  takes  the  form  of  a  curve.  On  the  horizontal  axis 
is  duration  T  in  minutes.  On  the  vertical  axis  is  the  probability  that  in  a  year  there 
will  be  at  least  one  incident  where  the  nodes  i  and  j  will  lose  connection  with  each 
other  for  T  minutes  or  longer. 

AEDC,  the  summary  quantity  for  the  whole  network,  is  the  arithmetic  average  of 
the  EDCij  over  all  pairs  of  nodes,  {i,j},  with  non-zero  traffic  in  the  network. 

7.5  Qualitative  Assessment  of  Network  Survivability  Techniques  Other 
considerations  beyond  the  GOF  and  ROF  techniques  are  often  important  to  assess 
network  survivability  techniques.  This  section  defines  qualitative  assessment 
criteria  for  network  survivability  techniques. 

Failure  Types 

This  is  defined  to  be  the  types  of  failures  for  which  the  network  survivability 
technique  can  restore  some  or  all  signals.  The  general  categories  of  failure  types 
include: 

•  single  link  failure  (i.e.,  single,  partial,  or  total  cable  failure), 

•  two  simultaneous  (or  near  simultaneous)  link  failures, 

•  three  or  more  simultaneous  (or  near  simultaneous)  link  failures, 

•  single  node  failure  (i.e.,  partial  or  total  failure  of  a  network  element),  and 

•  multiple  node  failure. 

These  categories  can  be  used  to  assess  the  use  of  network  survivability  techniques. 
Restoration  Signal  Rate 

This  is  defined  to  be  the  rate  or  level  at  which  the  method  can  restore  signals.  For 
example,  self-healing  rings  can  restore  signals  at  line  rates  (OC-48,  etc.)  or  “path” 
rates,  i.e.,  per  channel,  and  DCS  methods  can  restore  at  various  signal  rates  (e.g., 
DS-1,  DS-3,  STS-n). 


Stability 

Stability  is  defined  to  be  the  response  or  ability  to  predict  the  response  of  the 
restoration  method  to  variations  or  “perturbations”  in  the  network  parameters,  such 
as  type  of  failure,  speed  of  data  links,- speed  of  hardware,  database  inconsistencies, 
data  communication  errors,  or  size  of  network. 

Capacity  Limitation 

Capacity  limitation  is  defined  to  be  the  inherent  capacity  limitations  of  a  single 
“contiguous”  configuration  of  the  restoration  method.  For  example,  a  SONET  OC- 
48  self-healing  ring  can  accommodate  48  STS- 1 /DS-3  signals  that  add/drop  out  of 
the  ring  before  another  ring  is  needed  (each  ring  is  in  some  sense  “autonomous,” 
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although  some  types  of  dual  homing  rings  can  interact);  a  DCS  reconfiguration 
method  has  no  hard  limit  because  more  DCSs  can  always  be  added  to  the 
contiguous  self-healing  network. 

Ability  to  Inter-work 

Each  restoration  method  requires  particular  equipment  and  spare  signal  capacity  to 
function.  This  ability  to  inter-work  is  defined  as  the  ability  of  a  given  restoration 
method  to  inter-work  with  other  restoration  methods  by  sharing  the  same  signal 
capacity  or  equipment. 

Control  Mechanisms 

The  general  types  of  control  mechanisms  for  restoration  methods  are  distributed 
(done  by  the  network  elements  or  equipment)  and  centralized  (done  by  an 
operations  system  or  subnetwork  controller).  Each  restoration  method  can  go 
through  different  phases  of  execution,  as  broadly  outlined  in  the  beginning  of 
Section  6.  For  a  given  restoration  method,  the  control  mechanism  can  differ  across 
different  phases.  For  example,  a  DCS  reconfiguration  method  may  use  distributed 
control  to  find  alternate  paths,  yet  use  centralized  control  to  accomplish  the 
“normalization”  step. 

Implementation  Requirements 

Implementation  requirements  are  the  systems,  procedures,  and  equipment,  needed 
to  implement  and  maintain  the  restoration  method  in  the  network.  These  are 
categorized  below: 

•  interfaces  with  centralized  network  operations  support  systems, 

•  interactions  with  network  planning  personnel,  procedures,  and  systems, 

•  network  design  algorithms  and  methods  (for  example,  a  DCS 
reconfiguration  method  must  determine  the  placement  of  spare  signal 
capacity  in  the  network  to  guarantee  a  prescribed  restoration  ratio  over  a 
defined  possible  set  of  failures),  and 

•  software  or  hardware  implementing  the  restoration  method  (including  cost, 
size,  speed,  and  complexity). 

8.  Suggestions  to  General  Industry 

1.  Use  the  (U,  D,  E)  triple  to  quantify  and  categorize  service  outages.  This 
approach  is  more  comprehensive  than  a  single  dimensional  measure,  such 
as  the  ULE. 

2.  The  performance  measures,  definitions  and  terminology  used  in  this 
document  should  become  the  standard  when  discussing  network 
survivability,  reliability,  integrity  and  performance.  For  example,  they  can 
be  used  in  joint  IC  -  EC  network  survivability  studies. 
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3.  Use  the  framework  for  classifying  network  survivability  techniques  to 
classify  future  network  survivability  analyses  and  performance  techniques. 

9.  Recommendations  to  Standards  Organizations 

9.1  Recommendations  for  Committee  T1 

1.  Review  established  performance  objectives  (e.g.,  restoral  time  objectives 
for  SONET  rings,  utilization  objectives  for  CCS  network  elements)  to 
ensure  they  are  appropriate  for  future  applications  (e.g.,  Broadband  ISDN, 
or  B-ISDN). 

2.  Use  the  performance  measures  and  terminology  introduced  in  this  report  in 
related  standards  work. 

9.2  Recommendations  for  Future  T1A1.2  Work 

1.  Quantify  the  qualifying  regions  for  the  (U,  D,  E)  triple  for  service  outage 
categories.  Numerical  limits,  such  as  the  FCC's  existing  outage  reporting 
thresholds  should  be  studied  further  to  establish  their  validity  for  comparing 
network  service  outages. 

2.  Obtain  additional  user  expectations  on  network  survivability  performance  to 
meet  current  and  future  needs.  Analyze  impacts  of  service  outages  on  user 
categories. 

3.  Develop  planning,  engineering  and  implementation  guidelines  for  network 
survivability. 

10.  Summary 

This  Technical  Report  has  addressed  concerns  of  the  Telecommunications 
Community  about  the  issue  of  network  survivability.  This  report  provides  a 
common  approach  to  describe  and  assess  network  survivability.  Terminology  has 
been  introduced.  A  framework  for  quantifying  and  categorizing  service  outages 
and  a  framework  for  classifying  network  survivability  techniques  have  been  built. 
Performance  analysis  techniques  have  also  been  characterized.  Recommendations 
have  been  included  for  the  interpretation  and  use  of  this  report  for  continuing 
industry  activities. 
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12.  Definitions 

These  definitions  refer  to  network  survivability  and  may  be  somewhat  different 

from  those  of  other  documents. 


Alternate  Route 
Call  Attempt 
Connectivity 


End  User 

Fault  Tolerance 

Grade  of  Service 
(Traffic) 


A  second  or  subsequent  choice  path  between  two  points. 

Any  attempt  to  set  up  a  connection  for  a  call. 

Node  (or  link)  connectivity:  minimum  number  of  nodes 
(or  links)  whose  removal  results  in  losing  all  paths  that 
can  be  used  to  transfer  information  from  a  source  to  a 
sink. 

Those  who  use  telecommunications  services;  i.e.,  those 
who  either  originate  or  terminate  telecommunications. 

The  ability  of  a  network  element  or  system  to  continue 
to  function  under  component  failure(s). 

The  proportion  of  calls,  usually  during  the  busy  hour, 
that  cannot  be  completed  due  to  limits  in  the  call- 
handling  capability  of  a  component  in  a  network.  For 
example,  service  objectives  are  defined  on  a  per-link 
(per-trunk  group)  basis  for  the  last-choice  groups  in  a 
traffic  network.  Any  performance  objective  or  measure 
for  the  network.  For  example,  the  probability  of  a  call 
being  blocked  or  delayed  more  than  a  specified  interval, 
expressed  as  a  decimal  fraction. 
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Layer 


Maintainability 


Network  Availability 
Network  Failure 


Network  Failure 
Triple 

Network  Integrity 

Network 

Performance 

Network  Reliability 

Network  Restoration 

Network 

Survivability 


Network 

Survivability  Model 


One  of  the  four  network  survivability  layers  into  which 
network  functionality  is  decomposed  (as  described  in 
Section  5).  Note,  unless  otherwise  stated,  these  layers 
are  not  the  same  as  those  of  the  ISO  OSI  seven  layer 
model  [25]. 

The  ability  of  an  item  under  stated  conditions  of  use,  to 
be  retained  in,  or  restored  to,  a  state  in  which  it  can 
perform  a  required  function,  when  maintenance  is 
performed  under  given  conditions,  and  using  stated 
procedures  and  resources. 

The  probability  a  network  can  perform  its  required 
functions. 

A  complete  or  partial  failure  of  a  component  or 
components  of  a  network  because  of  malfunction  or 
natural  or  human-caused  disasters.  Partial  failures 
include  degradation. 

A  combination  of  the  network’s  Unservability  (U), 
Duration  (D)  and  Extent  (E)  parameters  exceeds  a 
threshold. 

See  Network  Survivability. 

The  level  at  which  a  network  fulfills  its  function. 

See  Network  Survivability. 

Automatic  of  manual  methods  to  return  a  network  to  its 
normal  function  in  response  to  a  network  failure. 

Network  survivability  is:  (i)  the  ability  of  a  network  to 
maintain  or  restore  an  acceptable  level  of  performance 
during  network  failures  by  applying  various  restoration 
techniques,  and  (ii)  the  mitigation  or  prevention  of 
service  outages  from  network  failures  by  applying 
preventative  techniques. 

The  analytical  processes  defined  in  Section  7.2  for 
estimating  how  well  network  services  will  be  impacted 
and  restored  with  transparencies  to  the  users  as  a  result 
of  a  failure. 
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Service  Outage  The  state  of  a  service  when  network  failure(s)  impair  the 

initiation  of  new  requests  for  service  and/or  the 
continued  use  of  the  service  and  where  the  service 
outage  parameters  (U,  D,  E)  do  not  fall  into  the  “no 
outage”  qualifying  region. 

SONET  North  American  Standard  For  Synchronous  Fiber  Digital 

Transmission  [38,44]. 

13.  Abbreviations  and  Acronyms 


ABS 

Alternate  Billing  System 

ADM 

Add-Drop  Multiplexer 

AEDC 

Average  Expected  Downtime  of  Connection 

AIS 

Alarm  Indication  Signal 

APS 

Automatic  Protection  Switching 

AT 

Access  Tandem 

ATM 

Asynchronous  Transfer  Mode 

CCS 

Common  Channel  Signaling,  or 

Centa  (Hundred)  Call  Seconds 

CCSN 

Common  Channel  Signaling  Network 

CCSSO 

CCS  Switching  Office 

CO 

Central  Office 

DCS 

Digital  Cross-connect  System 

EC 

Exchange  Carrier  (local) 

EO 

End  Office 

FDDI 

Fiber  Distributed  Data  Interface 

GOF 

Given  Occurrence  of  Failure 

GTT 

Global  Title  Translation 

IC 

Interexchange  Carrier 

ISDN 

Integrated  Services  Digital  Network 

ISO 

International  Standards  Organization 

MSU 

Message  Signaling  Unit 

MTBF 

Mean  Time  Between  Failure 

MTBSO 

Mean  Time  Between  Service  Outages 

MTP 

Message  Transfer  Part  (SS7) 

MTSO 

Mobile  Telephone  Switching  Office 

MTTR 

Mean  Time  To  Repair 

OA&M 

Operations,  Administration  and  Maintenance 

OMAP 

Operations,  Maintenance  and  Administration  Part 

OSI 

Open  Systems  Interconnection 

OSS 

Operations  Support  System 

PCN 

Personal  Communications  Network 
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P  D  DC 

Probability  Distribution  of  Downtime  of  Connection 

ROF 

Random  Occurrence  of  Failure 

see  P 

Signaling  Connection  Control  Part 

SCP 

Service  Control  Point 

SEP 

Signaling  End  Point 

SHN 

Self  Healing  Network 

SHR 

Self  Healing  Ring 

SONET 

Synchronous  Optical  NET work 

SPE 

Synchronous  Payload  Envelope  (SONET) 

SS7 

Signaling  System  Number  7 

SSP 

Service  Switching  Point 

STE 

Span  Terminating  Equipment 

STP 

Signaling  Transfer  Point 

TCAP 

Transaction  Capabilities  Application  Part 

TSI 

Time  Slot  Interchange 

ULE 

User  Lost  Erlang 

VT 

Virtual  Tributary  (SONET) 

Appendix  A.  Telecommunications  Service  Priority  System 

Background: 

Since  1967,  a  system  has  been  in  existence  for  prioritizing  the  restoration  of 
commercially  supplied  telecommunications  for  the  U.S.  Government.  This  system 
has  been  known  as  the  Restoration  Priority  (RP)  System. 

Restoration  priorities  have  been  assigned  to  leased,  intercity,  private  line  circuits 
that  are  deemed  critical  to  the  National  Security  Emergency  Preparedness  (NSEP) 
posture  of  the  nation.  The  assignment  of  these  priorities  has  been  administered  by 
the  National  Communications  System  (NCS),  in  Arlington,  Va.,  for  the  FCC 
(Federal  Communications  Commissi' 

In  1985,  following  the  divestiture  of  A  i  ^T,  the  NCS  undertook  to  revamp  the  RP 
System.  A  joint  effort  by  the  telecommunications  industry  and  the  U.S. 
Government,  under  the  auspices  of  the  National  Security  Telecommunications 
Advisory  Committee  (NSTAC),  a  Presidential  advisory  body,  developed  the 
Telecommunications  Service  Priority  (TSP)  System. 

TSP  Rule: 

On  Dec.  1, 1989,  the  FCC  issued  Declaratory  Ruling  DA  98-1524  that  set  the 
effective  date  of  the  beginning  of  the  TSP  System  to  be  Sept.  10, 1990.  This  date 
is  referred  to  as  the  Initial  Operating  Capability  (IOC)  date  by  the  government. 
This  IOC  date  also  marks  the  end  of  RP-type  code  assignments  and  the  beginning 
of  a  transition  period,  during  which  all  end  users  will  be  required  to  submit  their 
current  circuits  with  RP  codes  for  recertification  under  the  new  TSP  criteria.  All 
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existing  RP  circuit  designations  will  remain  in  effect  until  March  1993  or  until 
recertification  by  the  user  under  the  TSP  criteria. 

The  TSP  System  officially  replaces  the  RP  System  under  Part  64,  Appendix  A,  of 
the  FCC  Rules  and  Regulations,  Chapter  1  of  Title  47  of  the  CFR. 

TSP  Documents: 

The  supplemental  documents  that  address  the  total  TSP  System  include  the  TSP 
Service  Vendor  Handbook  (NCSH  3-1-2),  the  TSP  Service  User  Manual  (NCSM 
3-1-1)  and  NCS  Directive  3-1  that  covers  the  government’s  participation  in  the 
system. 

TSP  Requirements : 

The  TSP  System  requires  ail  common  carriers  to  provision  new  services  and 
restore  designated  services  on  a  priority  basis.  Since  the  FCC  has  approved  and 
codified  the  TSP  System,  common  carriers  may  now  provide  preferential  treatment 
to  specially  designated  user  requests  for  service. 

Priority  levels  will  be  assigned  to  NSEP  telecommunications  services  that  will 
specify  the  order  in  which  provisioning  or  restoration  of  the  services  is  to  occur 
relative  to  other  NSEP  and  non-NSEP  services.  The  authorized  priority  levels  are 
designated  as  follows,  highest  to  lowest:  E  (Emergency),  1,  2,  3, 4  and  5  for 
provisioning  and  1 , 2,  3, 4  and  5  for  restoration.  The  responsibility  of  any  common 
carrier  is  to: 

a.  Provision  NSEP  telecommunications  services  before  non-NSEP  service 
requests,  when  NSEP  treatment  is  properly  invoked  and  a  12-character,  TSP 
Authorization  Code  is  provided  by  the  user/contract  officer.  Allocate 
resources  to  ensure  the  company’s  best  efforts  to  provide  NSEP  services 
according  to  and  in  order  of  the  provisioning  priority. 

b.  Restore  NSEP  telecommunications  services  before  non-NSEP  services  and 
in  order  of  priority  level.  Allocate  available  resources  to  restore  as  quickly 
as  practical.  Broadband  or  multiple  service  facilities  restoration  is  permitted 
even  though  it  might  result  in  restoration  of  services  with  no  priority  or 
lower  priority  level. 

Impacts  of  TSP: 

The  following  is  a  list  of  the  impacts  of  TSP: 
a.  Provisioning: 

•  The  customer  must  provide  a  TSP  Authorization  Code  to  receive 
preferential  treatment. 

•  A  TSP  Provisioning  Priority  must  be  assigned  as  the  eleventh  digit  of  the 
Authorization  Code  to  receive  accelerated  service  installation. 
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Provisioning  priorities  are  either  EMERGENCY  (designated  by  an  “E”)  or 
ESSENTIAL  (designated  by  a  1 , 2, 3, 4  or  5).  The  FCC  Order  requires 
that  service  suppliers  provision  Emergency  services  before  any  Essential 
TSP  service  or  non-TSP  service. 

•  All  common  carriers  must  provide  a  24  hour  point-of-contact  to  receive 
properly -invoked  provisioning  requests. 

b.  Restoration: 

•  Network  Operations  must  dispatch  service  personnel  outside  normal 
business  hours  to  restore  services 

-  assigned  priority  1 , 2  or  3, 

-  assigned  4  or  5,  if  the  next  business  day  is  more  than  24  hours  away. 

•  Operations  may  preempt  non-TSP  or  lower  priority  TSP  circuits  to  restore 
higher  priority  TSP  circuits. 

c.  Administrative: 

•  The  TSP  System  allows  state  and  local  governments,  together  with  private 
entities,  to  request  TSP  priority  assignments  through  the  use  of  Federal 
sponsorship. 

•  Industry  must  provide  a  separate  notice  to  the  NCS  upon  completion  of  a 
service  installation,  modification  or  cancellation  in  addition  to  any 
notification  which  is  supplied  directly  to  the  customer/contract  officer. 

•  All  TSP  records  must  be  reconciled  with  the  NCS  annually  and  every 
three  years  with  subcontractors. 

•  The  TSP  System  provides  the  basis  (via  tariff  or  contract)  for  carriers  to 
charge  for  record  management,  accelerated  provisioning  and  restoration 
services. 

Appendix  B.  Example  Network  Survivability  Analyses  and  Assessments 

Appendix  B.l  Bidirectional  Ring  Survivability  Example  As  an  example  at 
the  GOF  model,  consider  a  bidirectional  ring  network  with  N=6  nodes  connecting 
to  a  central  node  as  shown  in  Figure  B.l .  Here  Sa  is  the  fraction  of  remaining 
nodes  connected  to  a  central  node  (CN)  under  a  disaster  causing  Nu  nodes  to  be 
destroyed.  Depending  on  which  nodes  are  failed,  the  value  of  Sa  may  be  different. 


69 


Technical  Report  No.  24 


Figure  B.l:  Bidirectional  Ring  Example 

It  is  clear  that  Nu  is  a  random  variable  with  values  Nu=0,...,N.  In  this  example, 
the  number  of  failed  nodes  and  the  number  of  ways  Nu-2  can  occur  is  tabulated  in 
Table  B.l.  The  set  of  nodes  connected  to  the  central  node,  Na=0,...,N-Nu,  is  also 
shown.  Using  this  information,  the  fraction  of  nodes  connected  given  two  failed 
nodes  can  be  easily  obtained. 

Assuming  the  two-node  failure  case  (all  equally  probable),  there  are  15  ways  of 
choosing  the  failed  pair.  For  five  failed  pairs,  the  number  of  connected  nodes  is 
four,  and  for  four  failed  pairs,  the  number  of  connected  nodes  is  three,  and  so  forth. 
The  probability  frequency  function  PfSa=y]  is  shown  in  Figure  B.2  for  Nu=2. 
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. 15 . 
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Table  B.l:  Failure  Categories 
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Figure  B.2:  Probability  Frequency  Function  for  Nu=2 

The  probabilistic  nature  of  Sa  is  well  demonstrated  in  Figure  B.2.  For  the 
bidirectional  ring  network  shown  in  Figure  B.l,  when  two  nodes  are  destroyed, 
there  is  a  33%  probability  that  66%  of  the  nodes  are  connected  to  the  CN,  and  there 
is  about  a  6%  chance  that  no  nodes  are  connected  to  the  CN. 

Appendix  B.2  Network  Survivability  Assessment  Example  In  an  8-node,  7- 
link,  fixed  route  network,  shown  in  Figure  B.3,  total  Dt  =  56  point-to-point  DS-3 

demands  are  distributed  and  routed  over  the  links  of  the  network  prior  to  failure,  as 
indicated  in  Table  B.2.  There  are  Ct  =  10  demand  pairs  in  this  network. 


Figure  B.3:  8-node,  7-link  network  Example 

For  simplicity,  we  assume  a  sample  space  of  a  single  link  failure  and  that  the  set  of 
states  after  failure  is  limited  to  { x  j } ,  where  event  xj  denotes  the  failure  of  link  i. 
We  want  to  assess  the  survivability  of  this  network,  in  terms  of  three  survivability 
measures,  (i)  average  fraction  of  traffic  (DS-3s)  that  remains  after  failure  (or 


71 


Technical  Report  No.  24 

Average  Traffic  Survivability  Ratio),  Sa,  (ii)  average  fraction  of  subscribers  that 
remain  connected  after  failure  (or  Average  Connectivity  Ratio),  Ca  and  (iii) 
Average  Time  for  single  link  Restoration,  t^. 


Node  Pair 

Demand 

(DS-3s) 

COl - C03 

5 

1-2 

COl  -  C06 

11 

1-2-4-6-7 

COl - C08 

8 

1-2-3 

COl - C04 

4 

1-2-4 

COl  -  C07 

5 

1-2-5 

C06  -  C08 

6 

7-64-3 

C06 - C07 

4 

7-64-5 

C05  -  C08 

6 

64-3 

C05  -  C07 

3 

64-5 

C04  -  C08 

4 

4-3 

Table  B.2:  Point-to-point  demand  and  routing  table  for  Example  B.2 

By  considering  the  network  routing  plan,  aggregated  load,  Lj,  on  the  i*h  link  can  be 
found.  If  a  given  link  fails,  Lj  DS-3s  on  that  link  would  be  lost,  or 

Lj  =  sum  of  all  demands  going  through  link  i 

and  we  have: 

Sa(xj)  =  (Dt "  M)/Dt 

Where  Sa(xj)  is  the  Traffic  Survivability  Ratio,  assuming  the  i^1  link  has  failed. 

For  example  for  link  1 ,  L  j  =33  DS-3s  and  if  this  link  fails  33  DS-3s  out  of  56 
DS-3s  would  fail  and  only  23  DS-3s  would  survive.  Thus  Sa(x  j)=  23/56  =  0.41 . 
The  value  of  Sa(xj)  for  all  links  (i=l  ,2,...,7)  are  tabulated  in  Table  B.3. 

As  indicated  in  Table  B.2,  there  are  Ct  =  1 0  point-to-point  “connections”  in  this 
network.  Each  link  carries  nj  of  these  connections  and  we  have 

Ca(x  j)  =  (Ct  -  ni)  /  Ct 

where  Ca(xj)  is  the  Connectivity  Ratio,  assuming  the  ilh  link  has  failed.  For 
example  n  j  =  5  connections  go  through  link  1 .  Therefore  if  link  1  fails,  half  of  all 
network  connections  would  be  disconnected,  or  Ca(x  j)  =  5/10  =  0.5.  The  values  of 
Ca(xi),  i=l,2,..,7  are  also  tabulated  in  Table  B.3. 


72 


H 


Technical  Report  No.  24 

Similar  to  example  B.l ,  we  can  now  attach  a  probability  to  each  event,  x  j.  If  we 
assume  that  each  event,  x  j,  has  equal  probability,  the  expected  values  of  the 
random  variables  are 


E(Sa]  =  l/K£Sa(xiX 
i 

E[Caj  =  l/KXCa(xi), 
i 


E|tRl  =  1/KXVCXi). 
i 

where  K  =  total  number  of  links  (K  =  7  in  this  example).  tr(xj)  is  the  estimated 

Restoration  Time  for  the  ith  link  using  normal  restoration  methods  and  can  be 
obtained  using  long  term  link  outage  data.  Note  that  ^.(Xj)  is  a  function  of  the 

length  of  link  i.  Using  Table  B.3,  we  have: 

E(Sa)  =  0.51 ,  E(Ca)  =  0.54  and  E(tR)  =  70.7  minutes 


Link 

No.(i) 

Lgi 

(DS-3s) 

nci 

Sa(Xj) 

Mxi) 

(minutes) 

1 

33 

5 

0.41 

15 

2 

33 

5 

0.41 

160 

3 

24 

4 

0.55 

0.6 

55 

4 

38 

7 

0.32 

0.3 

45 

5 

12 

3 

0.78 

0.7 

100 

6 

30 

5 

0.46 

0.5 

35 

7 

21 

3 

0.62 

0.7 

85 

Table  B.3:  Link  information  for  Example  B.2 
Appendix  C.  User  Expectations 

This  section  discusses  the  expectations  of  certain  users  or  user  groups  with  respect 
to  service  continuity  and  survivability. 

Appendix  C.l  U.S.  Government  User  Expectations  Government  agencies 
have  diverse  and  unique  user  expectations  such  as  Telecommunications  Service 
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Priority22  (TSP),  security,  interoperability,  compatibility,  redundancy,  diversity, 
hardening,  prevention  of  loss  or  corruption  of  critical  data,  measurable 
Reliability/Maintainability/Availability  (RMA)  performance  parameters23,  and 
responsiveness.  For  example,  the  U.S.  Federal  Aviation  Administration  (FAA)  and 
Department  of  Defense  (DoD)  define  and  require  several  types  and  scopes  of 
communications  services  and  are  concerned  about  the  causes  of  probable  network 
failures  shown  in  Table  C.l .  The  users  of  the  service  types  defined  in  Table  C.l 
depend  upon  these  availability  and  restoration  time  requirements. 


Service 

Types 

Service  Scope 

Failure  Causes 

Expected 

Availability 

Expected 
Restoration  Time 

Critical 

Functions  or  services 
which,  if  lost,  would 
Drevent  the 
network/  system 
from  exercising  safe 
operation  and  control 
for  the  end  users. 

Earthquakes,  floods, 
hurricanes,  tornadoes, 
acts  of  war, 
hardware/software 
failures. 

electromagnetic  pulse, 
etc. 

0.9999987 

<  6  seconds 

Essential 

Functions  or  services 
which,  if  lost,  would 
reduce  the  caoabilitv 
of  the  network/system 
to  exercise  safe 
operation  and  control 
for  the  end  users. 

Terrorism,  fires, 
power/battery  failures, 
etc. 

0.999 

approximately 

1 0  minutes 

Routine 

Functions  or  services 
which,  if  lost,  would 
not  significantly 
degrade  the  capability 
of  the  network/system 
to  exercise  safe 
operation  and  control 
for  the  end  users. 

Node  and  link  failures, 
subsystem  failures, 
equipment  failures,  etc. 

0.99 

approximately 

2  hours 

Table  C.l:  Government  Communications  Services24 


22  See  Appendix  A  for  details  on  TSP. 

23  Definitions  of  terms  used  here  and  in  government  acquisitions  arc  provided  in  this  report  and  by  U.S.  Federal 
Standard  1037B.  Telecommunications  Glossary. 

24  The  expected  availability  values  recorded  here  require  information  as  to  the  population  and  interval  over  which 
they  have  been  calculated. 
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In  national  and  international  emergencies,  the  government  expects 
networks/systems  to  be  capable  of  providing  survivable  communications  for 
national  decision  makers,  executing  crisis  management  control,  offering  distributed 
control  of  information,  and  restoring  networks  and  systems.  To  provide  for  these 
expectations,  networks/systems  should,  at  a  minimum,  have  the  following 
survivability  features  and  capabilities: 

Connectivity:  The  DoD  requires  connectivity  to  provide  the  capability  of 
connecting  several  networks.  In  order  to  react  to  rapidly  changing  environments 
and  to  be  able  to  easily  and  quickly  reconfigure  networks,  the  ability  to 
communicate  facility  status  to  network  management  centers  must  be  provided. 

Security:  Systems  that  provide  telecommunications  services  must  be  made  secure 
from  both  internal  and  external  threats  such  as  unauthorized  access,  sabotage, 
hackers,  and  terrorism.  Survivability  must  address  and  counter  these  threats 
through  the  use  of  security  services  and  mechanisms  such  as:  identification, 
authentication,  and  authorization  of  users,  operators,  and  maintainers  of  the  system; 
access  control;  database  integrity;  and  encryption  of  network  control  messages. 
Multi-level  security  must  not  be  impacted  by  network  outages. 

Endurability  :  The  communications  network  must  be  capable  of  operating  using 
commercial  power,  uninterruptable  power,  and  back-up  battery  systems  during 
failures.  In  addition  to  a  network's  capability  to  withstand  hardship,  stress,  and 
adversity  during  given  periods  of  time  and  levels  of  network  performance. 

Network  designs  should  also  include  counter  measures  for  survivability  against 
threats  of  nuclear  and  High-Altitude  Electro-Magnetic  Pulses  (HEMP). 

Interoperability:  The  communications  network  must  be  capable  of  operating  with 
all  agencies  and  commercial  providers. 

Compatibility':  The  communications  network  must  have  the  ability  to  coexist  and 
not  interfere  with  the  operations  of  other  networks/systems. 

Survivability  Performance  Parameters:  Numerical  values  for  these  parameters 
should  be  provided  because  they  are  relevant  to  the  network  management  of 
government  long-haul  and  base-level  networks.  These  parameters  are  a  function  of 
the  following  factors: 

•  the  type  of  switching  and  transmission  technology  to  be  employed  in  the 
network  under  consideration  (i.e.,  SONET,  ISDN,  ATM  Switching,  FDDI), 

•  the  robustness  of  the  network  management  technology  used  to  control  these 
network  resources, 

•  network  topology,  and 
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•  the  required  level  of  network  throughput  efficiency,  based  on  economic 
constraints. 

Appendix  D.  Tolerance  Categories  for  Restoration  Times 

To  design  networks  to  meet  user  expectations,  information  is  needed  about  the 
impact  of  failures  on  network  services.  This  information  can  be  used  in  standards 
activities  for  determining  restoration  times  for  network  facilities  as  well  as  by 
network  planners  to  deploy  appropriate  network  survivability  techniques  for  each 
category  of  customer.  One  of  the  most  crucial  parameters  for  determining  the 
impact  of  a  service  outage  is  the  outage  duration  (the  outage  duration  here  is 
defined  to  be  the  time  interval  between  first  loss  of  a  particular  unit  of  usage  until 
that  unit  of  service  is  fully  restored).  Note  that  actual  service  outage  time  will 
often  exceed  facility  restoration  time.  More  detail  on  this  topic  can  be  found  in 
[35]25. 

Different  services  have  different  outage  duration  tolerances.  For  example, 
voiceband  call  tolerances  (i.e.,  the  interval  in  which  calls  in  progress  are 
abandoned)  can  vary  anywhere  from  150  milliseconds  (see  Table  D.2)  to  two 
seconds  while  data  (packet)  session  timeout  can  vary  from  two  to  300  seconds. 
Today,  session  dependent  applications  such  as  file  transfer  commonly  use  System 
Network  Architecture  (SNA)  and  TCP/IP  protocol  architectures.  With  SNA,  in 
particular,  a  session  dependent  application  has  a  software  programmable  session 
time-out  of  from  1 .1  to  255  seconds  (36]  and  can  be  specified  by  users  based  on 
file  size,  for  example.  Lost  data  from  shorter  interruptions  is  retransmitted,  a 
process  which  in  some  applications,  is  triggered  by  receiver  time-outs.  Receivers 
may  begin  to  time-out  based  on  twice  the  round-trip  delay. 

X.25  packet  networks  often  incorporate  idle  channel  state  condition  timers  that  are 
settable  from  one  to  30  seconds,  in  one  second  increments  (with  a  suggested  time 
of  5  seconds)  [45].  When  links  are  lost  between  switches,  these  timers  may  expire, 
and  if  they  do,  they  will  trigger  the  disconnection  of  all  virtual  calls  that  were  up  on 
those  links.  The  customers  must  then  restart  their  sessions.  It  is  common  (but  not 
required)  in  these  networks  to  feature  a  virtual  circuit  reconnect  capability,  through 
alternate  routing  around  the  failed  links,  but  only  after  the  call  is  reestablished. 

In  Asynchronous  Transfer  Mode  (ATM)  networks  (e.g.,  those  supporting  Switched 
Multi-megabit  Digital  Service  (SMDS)  [37]),  normal  cell  routing  processes  are 
interrupted  and  rerouting  processes  may  be  started  after  a  certain  interval.  This 
interval  can  be  estimated  at  200  milliseconds  based  on  SMDS  objectives  in  [37]. 


25  A  list  of  references  is  in  Section  1 1 . 
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The  objective  in  SMDS  supporting  networks  for  a  switch/router  to  update  its 
routing  tables  upon  notification  of  a  link  set  change  through  lower-layer  OS1 
protocols  is  less  than  100  milliseconds.  It  is  reasonable  to  expect  that  within  200 
milliseconds  from  the  onset  of  the  link  outage,  that  the  rerouting  process  should 
have  been  started,  allowing  time  for  topology  updating  information  to  propagate  to 
all  switches  (50  milliseconds),  and  delay  (50  milliseconds)  for  waiting  on  physical 
level  protection,  if  any  (e.g.,  point-to-point  systems  using  Automatic  Protection 
Switching).  However,  in  large  or  more  complex  networks  this  interval  may  be 
longer  than  200  milliseconds.  The  recovery  of  any  lost  data  is  handled  through 
higher  layer  OS  I  data  protocols. 

Figure  D.l  shows  the  impact  on  various  network  services  as  a  function  of 
restoration  times  for  SONET-based  architectures  due  to  a  network  failure.  Outage 
durations  include  restoration  times  at  the  SONET  level.  The  restoration  times  are 
classified  into  intervals  as  illustrated  in  this  figure.  Restoration  times  within  a 
given  interval  will  have  roughly  the  equivalent  impact  on  services.  Restoration 
times  under  50  milliseconds  will  be  transparent  to  most  services.  These  time 
intervals  are  tabulated  in  Table  D.l. 


Figure  D.l:  Restoration  Time  Impact  on  Customers 
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3 
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~  1 0  seconds 

4 
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5 

~5  minutes* 

~30  minutes26 

6 

~30  minutes26 

unspecified 

*These  restoration  time  intervals  and  their  impacts  should  be  studied. 

Table  D.I:  Restoration  Time  Intervals 


Restoration  times  below  50  milliseconds  will  meet  network  protection  switching 
time  requirements  at  the  SONET  level  [38,44].  Recovery  within  this  time  frame 
will  be  without  interruption  of  service  (a  service  “hit”).  A  “hit”  is  a  temporary 
interruption  of  service  that  causes  only  a  reframing  of  a  distant  terminal  (e.g., 
digital  channel  bank)  off  of  the  SONET  backbone.  A  “hit”  does  not  cause  a 
transmission  failure  alarm  to  be  issued  at  the  distant  terminal. 


For  a  circuit  switched  network,  the  minimum  disconnect  timing  interval  for  a 
possible  false  disconnect  to  occur  at  a  downstream  switch  trunk  interface  is  150 
milliseconds  [39]  (i.e.,  150  milliseconds  on-hook  supervisory  signal  state).  A 
failure  in  the  interoffice  trunk  network  may  cause  the  supervisory  signal  state  to 
appear  as  on-  hook,  rather  than  off-hook,  to  the  switch.  The  network  allocation  of 
this  150  milliseconds  is  as  follows: 


+  10  milliseconds 
+  40  milliseconds 
+  50  milliseconds 
Sub-total  100  milliseconds 

+  50  milliseconds 
Total  150  milliseconds 


failure  detection  time 
reframing  of  lower-rate  multiplexes 
reframing  of  distant  terminal 

switching  time  (SONET  level) 


Table  D.2:  Minimum  Trunk  Disconnect  Time 

The  time  frame  between  50  milliseconds  to  just  under  200  milliseconds  will  have 
minimal  impact  to  services.  Any  affected  voiceband  switched  calls  (voice  or  data) 
will  have  less  than  a  5%  probability  of  being  dropped  (related  to  signaling  freezing 
limitations  with  in-band  signaling).  Keeping  the  restoration  time  to  under  200 
milliseconds  will  prevent  any  affected  voiceband  circuit-switched  calls  that  were 


26  This  value  is  based  on  FCC  outage  reporting  requirements  and  should  be  reviewed. 
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on  trunks  associated  with  older  channel  banks  from  being  dropped.  Note  that  a  200 
millisecond  restoration  time  appears  as  a  300  millisecond  outage  to  a  channel  bank 
(when  including  detection  time  and  reframe  times).  Old  channel  bank  types  (e.g., 
D1  A,  DID,  D2,  D3)  exhibit  carrier  group  alarm  times  of  300  milliseconds  to  500 
milliseconds.  In  this  same  time  frame,  the  normal  cell  routing  process  in  cell-relay 
networks  (e.g.,  supporting  SMDS)  should  have  been  interrupted,  and  the  cell 
rerouting  process  started.  Again,  this  estimated  200  millisecond  interval  for  cell- 
relay  networks  may  be  longer  in  large  or  more  complex  networks. 

The  CCSN  will  provide  better  performance  by  virtue  of  being  out-of-band  and 
separate  from  the  path  of  the  call.  The  CCSN  uses  DS-0  circuits  for  signaling  links 
that  are  bunched  together  on  a  DS-1  trunk  (carrier).  Calls  only  in  the  process  of 
being  set-up  may  be  lost  with  an  outage  on  the  signaling  links;  established  calls 
should  not  be  affected.  Note  that  according  to  SS7  specifications  [22],  a  signaling 
link  is  considered  failed  and  taken  out  of  service  when  a  loss  of  alignment  (i.e.,  no 
signaling  flags)  lasts  for  approximately  146  milliseconds  at  the  STF.  A  changeover 
to  the  alternate  signaling  link  is  then  initiated.  A  restoration  time  in  the  time  frame 
between  50  milliseconds  to  just  under  200  milliseconds  will  trigger  this 
changeover. 

The  second  time  frame  is  between  200  milliseconds  to  just  under  two  seconds. 
Performance  of  the  voice  service  is  only  slightly  degraded  in  this  range  because 
only  affected  voiceband  circuit-switched  calls  that  are  on  trunks  associated  with 
old  channel  bank  types  will  be  dropped.  Network  data  from  one  local  exchange 
provider  indicates  that  roughly  1 2%  of  DS-0  circuits  that  are  carried  on  channel 
banks  (including  DS-Os  that  carry  trunks  of  circuit  switched  networks  as  well  as 
non-switched  circuits)  are  in  fact  carried  on  older  type  channel  banks.  Thus,  the 
total  number  of  trunks  in  circuit  switched  networks  that  are  carried  on  older  type 
channel  banks  is  less  than  12%  because  significant  numbers  of  DS-0  trunks  do  not 
route  over  channel  banks  (for  example,  DS-0  trunks  that  directly  interface  into 
digital  switches  at  the  DS-1  rate).  Video  services  can  become  degraded  in  just  1/30 
second  (loss  of  one  frame).  Furthermore,  restoration  times  greater  than  100 
milliseconds  have  been  shown  to  have  significant  impact  on  the  video  codec 
reframe  process  and  can  become  customer  reportable  impairments  [46]. 
Nevertheless,  an  actual  outage  of  less  than  two  seconds  is  not  critical  for 
entertainment/educational  type  video  services  (one  to  two  seconds  is  close  to 
human  reaction  time).  This  observation  applies  to  video  services  with  no  protocol 
for  recovery  of  lost  frames. 

The  third  time  frame  is  between  two  seconds  to  just  under  ten  seconds.  The  ten 
second  upper  bound  is  consistent  with  the  SONET  carrier  group  alarm 
requirements  [38,44].  When  outages  exceed  two  seconds,  DS-0,  nxDS-0,  and  DS-1 
call-dropping  will  occur.  Private  line  disconnects  will  also  occur.  Voiceband  data 
modems  will  time-out  (typically  two  to  three  seconds)  after  detecting  a  loss  of 
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incoming  data  carrier.  The  modems  must  then  be  resynchronized  (usually  a  15 
second  procedure)  after  the  data  carrier  is  restored. 

B-ISDN  calls  in  the  future  may  have  alarm  thresholds  set  higher  than  two  seconds 
(e.g.,  five  seconds),  but  B-ISDN  calls  would  still  be  dropped  in  this  time  range, 
assuming  the  outage  lasts  longer  than  five  seconds.  In  this  time  range  there  is  a 
potential  for  packet  (X.25)  virtual  calls  to  be  dropped  depending  on  the  idle  link 
timer  system  setting.  Disruption  of  data  communications  sessions  (connection 
oriented,  e.g.,  SNA)  may  occur  in  this  range  depending  on  the  session  termination 
time.  High-speed,  interactive  applications  (for  example,  video  image)  may  be 
impacted  in  this  time  range  since  these  applications  are  sensitive  to  delay.  This 
time  range  is  also  unattractive  for  high-priority  data  applications,  for  example, 
transfer  of  funds  using  automated  teller  machines.  Automated  teller  machines  use 
private  data  communications  networks  based  mostly  on  SNA  and  TCP/IP  protocol 
architectures.  Services  that  are  less  sensitive  to  delay  such  as  electronic  mail  (e- 
mail)  should  not  be  greatly  impacted.  Frame  relay  data  services  may  be  sensitive 
to  service  outages  because  of  their  high-speed  and  lack  of  ability  of  the  frame  relay 
network  layer  protocol  (roughly  equivalent  to  the  third  layer  of  the  OSI  stack)  to 
recover  lost  data. 

The  fourth  time  frame  is  between  ten  seconds  to  just  under  five  minutes27.  In  this 
fourth  time  range,  packet  (X.25)  calls  and  data  communications  sessions  will  be 
disrupted  and  customers  would  attempt  to  reinitialize  their  connections.  The  five 
minute  upper  bound  is  not  meant  to  be  a  firm  number  based  on  particular  software 
or  hardware  specifications,  but  rather  an  estimate  based  on  general  planning  goals 
for  various  ECs.  For  example,  the  five  minute  restoration  time  is  the  goal  for 
completing  service  restoration  by  some  network  management  systems  and  is 
considered  responsive  to  most  customers.  Also,  the  five  minute  upper  bound 
covers  the  maximum  SNA  session  timeout  of  255  seconds. 

In  the  fifth  time  frame,  digital  switches  in  switched  networks  may  experience  a 
buildup  of  network  congestion.  There  may  be  minor  social  and  business  impacts 
with  this  outage  event.  The  sixth  time  frame  is  based  on  the  30-minute  FCC 
requirement  on  carriers  that  major  outages  (affecting  at  least  30,000  or  50,000 
customers28  and  exceeding  30  minutes)  be  reportable. 


27  Based  on  industry  tradition  and  management  systems.  This  value  should  be  validated. 

28  The  FCC  required  5(),(XX)  lines  {51 ,521,  with  30,(XX)  lines  on  a  trial  basis  [53).  The  unanimous  recommendation 
of  the  Threshold  Reporting  focus  team  of  the  NRC  on  December  15,  1992  was  30.000  lines. 
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